xref: /freebsd/sys/geom/notes (revision 69956de3)
1801bb689SPoul-Henning Kamp
2801bb689SPoul-Henning KampFor the lack of a better place to put them, this file will contain
3801bb689SPoul-Henning Kampnotes on some of the more intricate details of geom.
4801bb689SPoul-Henning Kamp
5801bb689SPoul-Henning Kamp-----------------------------------------------------------------------
6801bb689SPoul-Henning KampLocking of bio_children and bio_inbed
7801bb689SPoul-Henning Kamp
8801bb689SPoul-Henning Kampbio_children is used by g_std_done() and g_clone_bio() to keep track
9801bb689SPoul-Henning Kampof children cloned off a request.  g_clone_bio will increment the
10801bb689SPoul-Henning Kampbio_children counter for each time it is called and g_std_done will
11801bb689SPoul-Henning Kampincrement bio_inbed for every call, and if the two counters are
12801bb689SPoul-Henning Kampequal, call g_io_deliver() on the parent bio.
13801bb689SPoul-Henning Kamp
14801bb689SPoul-Henning KampThe general assumption is that g_clone_bio() is called only in
15801bb689SPoul-Henning Kampthe g_down thread, and g_std_done() only in the g_up thread and
16801bb689SPoul-Henning Kamptherefore the two fields do not generally need locking.  These
17801bb689SPoul-Henning Kamprestrictions are not enforced by the code, but only with great
18801bb689SPoul-Henning Kampcare should they be violated.
19801bb689SPoul-Henning Kamp
20801bb689SPoul-Henning KampIt is the responsibility of the class implementation to avoid the
21801bb689SPoul-Henning Kampfollowing race condition:  A class intend to split a bio in two
22801bb689SPoul-Henning Kampchildren.  It clones the bio, and requests I/O on the child.
23801bb689SPoul-Henning KampThis I/O operation completes before the second child is cloned
24801bb689SPoul-Henning Kampand g_std_done() sees the counters both equal 1 and finishes off
25801bb689SPoul-Henning Kampthe bio.
26801bb689SPoul-Henning Kamp
27801bb689SPoul-Henning KampThere is no race present in the common case where the bio is split
28801bb689SPoul-Henning Kampin multiple parts in the class start method and the I/O is requested
29801bb689SPoul-Henning Kampon another GEOM class below:  There is only one g_down thread and
30801bb689SPoul-Henning Kampthe class below will not get its start method run until we return
31801bb689SPoul-Henning Kampfrom our start method, and consequently the I/O cannot complete
32801bb689SPoul-Henning Kampprematurely.
33801bb689SPoul-Henning Kamp
34801bb689SPoul-Henning KampIn all other cases, this race needs to be mitigated, for instance
35801bb689SPoul-Henning Kampby cloning all children before I/O is request on any of them.
36801bb689SPoul-Henning Kamp
37801bb689SPoul-Henning KampNotice that cloning an "extra" child and calling g_std_done() on
38801bb689SPoul-Henning Kampit directly opens another race since the assumption is that
39801bb689SPoul-Henning Kampg_std_done() only is called in the g_up thread.
40cce7303aSPoul-Henning Kamp
41cce7303aSPoul-Henning Kamp-----------------------------------------------------------------------
42cce7303aSPoul-Henning KampStatistics collection
43cce7303aSPoul-Henning Kamp
44cce7303aSPoul-Henning KampStatistics collection can run at three levels controlled by the
45cce7303aSPoul-Henning Kamp"kern.geom.collectstats" sysctl.
46cce7303aSPoul-Henning Kamp
47cce7303aSPoul-Henning KampAt level zero, only the number of transactions started and completed
48cce7303aSPoul-Henning Kampare counted, and this is only because GEOM internally uses the difference
49cce7303aSPoul-Henning Kampbetween these two as sanity checks.
50cce7303aSPoul-Henning Kamp
51cce7303aSPoul-Henning KampAt level one we collect the full statistics.  Higher levels are
52cce7303aSPoul-Henning Kampreserved for future use.  Statistics are collected independently
53cce7303aSPoul-Henning Kampon both the provider and the consumer, because multiple consumers
54cce7303aSPoul-Henning Kampcan be active against the same provider at the same time.
55cce7303aSPoul-Henning Kamp
56cce7303aSPoul-Henning KampThe statistics collection falls in two parts:
57cce7303aSPoul-Henning Kamp
58cce7303aSPoul-Henning KampThe first and simpler part consists of g_io_request() timestamping
59cce7303aSPoul-Henning Kampthe struct bio when the request is first started and g_io_deliver()
60cce7303aSPoul-Henning Kampupdating the consumer and providers statistics based on fields in
61cce7303aSPoul-Henning Kampthe bio when it is completed.  There are no concurrency or locking
62cce7303aSPoul-Henning Kampconcerns in this part.  The statistics collected consists of number
63cce7303aSPoul-Henning Kampof requests, number of bytes, number of ENOMEM errors, number of
64cce7303aSPoul-Henning Kampother errors and duration of the request for each of the three
65cce7303aSPoul-Henning Kampmajor request types: BIO_READ, BIO_WRITE and BIO_DELETE.
66cce7303aSPoul-Henning Kamp
67cce7303aSPoul-Henning KampThe second part is trying to keep track of the "busy%".
68cce7303aSPoul-Henning Kamp
69cce7303aSPoul-Henning KampIf in g_io_request() we find that there are no outstanding requests,
70cce7303aSPoul-Henning Kamp(based on the counters for scheduled and completed requests being
71cce7303aSPoul-Henning Kampequal), we set a timestamp in the "wentbusy" field.  Since there
72cce7303aSPoul-Henning Kampare no outstanding requests, and as long as there is only one thread
73cce7303aSPoul-Henning Kamppushing the g_down queue, we cannot possibly conflict with
74cce7303aSPoul-Henning Kampg_io_deliver() until we ship the current request down.
75cce7303aSPoul-Henning Kamp
76cce7303aSPoul-Henning KampIn g_io_deliver() we calculate the delta-T from wentbusy and add this
77cce7303aSPoul-Henning Kampto the "bt" field, and set wentbusy to the current timestamp.  We
78cce7303aSPoul-Henning Kamptake care to do this before we increment the "requests completed"
79cce7303aSPoul-Henning Kampcounter, since that prevents g_io_request() from touching the
80cce7303aSPoul-Henning Kamp"wentbusy" timestamp concurrently.
81cce7303aSPoul-Henning Kamp
82cce7303aSPoul-Henning KampThe statistics data is made available to userland through the use
83cce7303aSPoul-Henning Kampof a special allocator (in geom_stats.c) which through a device
84cce7303aSPoul-Henning Kampallows userland to mmap(2) the pages containing the statistics data.
85cce7303aSPoul-Henning KampIn order to indicate to userland when the data in a statstics
86cce7303aSPoul-Henning Kampstructure might be inconsistent, g_io_deliver() atomically sets a
87cce7303aSPoul-Henning Kampflag "updating" and resets it when the structure is again consistent.
888a63edc3SPoul-Henning Kamp-----------------------------------------------------------------------
898a63edc3SPoul-Henning Kampmaxsize, stripesize and stripeoffset
908a63edc3SPoul-Henning Kamp
918a63edc3SPoul-Henning Kampmaxsize is the biggest request we are willing to handle.  If not
928a63edc3SPoul-Henning Kampset there is no upper bound on the size of a request and the code
938a63edc3SPoul-Henning Kampis responsible for chopping it up.  Only hardware methods should
948a63edc3SPoul-Henning Kampset an upper bound in this field.  Geom_disk will inherit the upper
958a63edc3SPoul-Henning Kampbound set by the device driver.
968a63edc3SPoul-Henning Kamp
978a63edc3SPoul-Henning Kampstripesize is the width of any natural request boundaries for the
98*69956de3SPoul-Henning Kampdevice.  This would be the optimal width of a stripe on a raid unit.
99*69956de3SPoul-Henning KampThe idea with this field is to hint to clustering type code to not
100*69956de3SPoul-Henning Kamptrivially overrun these boundaries.
1018a63edc3SPoul-Henning Kamp
1028a63edc3SPoul-Henning Kampstripeoffset is the amount of the first stripe which lies before the
1038a63edc3SPoul-Henning Kampdevices beginning.
1048a63edc3SPoul-Henning Kamp
1058a63edc3SPoul-Henning KampIf we have a device with 64k stripes:
1068a63edc3SPoul-Henning Kamp	[0...64k[
1078a63edc3SPoul-Henning Kamp	[64k...128k[
1088a63edc3SPoul-Henning Kamp	[128k..192k[
1098a63edc3SPoul-Henning KampThen it will have stripesize = 64k and stripeoffset = 0.
1108a63edc3SPoul-Henning Kamp
1118a63edc3SPoul-Henning KampIf we put a MBR on this device, where slice#1 starts on sector#63,
1128a63edc3SPoul-Henning Kampthen this slice will have: stripesize = 64k, stripeoffset = 63 * sectorsize.
1138a63edc3SPoul-Henning Kamp
1148a63edc3SPoul-Henning KampIf the clustering code wants to widen a request which writes to
1158a63edc3SPoul-Henning Kampsector#53 of the slice, it can calculate how many bytes till the end of
1168a63edc3SPoul-Henning Kampthe stripe as:
1178a63edc3SPoul-Henning Kamp	stripewith - (53 * sectorsize + stripeoffset) % stripewidth.
118679c4aa6SPoul-Henning Kamp-----------------------------------------------------------------------
119679c4aa6SPoul-Henning Kamp
120679c4aa6SPoul-Henning Kamp#include file usage:
121679c4aa6SPoul-Henning Kamp
122679c4aa6SPoul-Henning Kamp                 geom.h|geom_int.h|geom_ext.h|geom_ctl.h|libgeom.h
123679c4aa6SPoul-Henning Kamp----------------+------+----------+----------+----------+--------+
124679c4aa6SPoul-Henning Kampgeom class      |      |          |          |          |        |
125679c4aa6SPoul-Henning Kampimplementation  |   X  |          |          |          |        |
126679c4aa6SPoul-Henning Kamp----------------+------+----------+----------+----------+--------+
127679c4aa6SPoul-Henning Kampgeom kernel     |      |          |          |          |        |
128679c4aa6SPoul-Henning Kampinfrastructure  |   X  |      X   |  X       |    X     |        |
129679c4aa6SPoul-Henning Kamp----------------+------+----------+----------+----------+--------+
130679c4aa6SPoul-Henning Kamplibgeom         |      |          |          |          |        |
131679c4aa6SPoul-Henning Kampimplementation  |      |          |  X       |    X     |  X     |
132679c4aa6SPoul-Henning Kamp----------------+------+----------+----------+----------+--------+
133679c4aa6SPoul-Henning Kampgeom aware      |      |          |          |          |        |
134679c4aa6SPoul-Henning Kampapplication     |      |          |          |    X     |  X     |
135679c4aa6SPoul-Henning Kamp----------------+------+----------+----------+----------+--------+
136679c4aa6SPoul-Henning Kamp
137679c4aa6SPoul-Henning Kampgeom_slice.h is special in that it documents a "library" for implementing
138679c4aa6SPoul-Henning Kampa specific kind of class, and consequently does not appear in the above
139679c4aa6SPoul-Henning Kampmatrix.
1405ae652c0SPoul-Henning Kamp-----------------------------------------------------------------------
1415ae652c0SPoul-Henning KampRemovable media.
1425ae652c0SPoul-Henning Kamp
1435ae652c0SPoul-Henning KampIn general, the theory is that a drive creates the provider when it has
1445ae652c0SPoul-Henning Kampa media and destroys it when the media disappears.
1455ae652c0SPoul-Henning Kamp
1465ae652c0SPoul-Henning KampIn a more realistic world, we will allow a provider to be opened medialess
1475ae652c0SPoul-Henning Kamp(set any sectorsize and a mediasize==0) in order to allow operations like
1485ae652c0SPoul-Henning Kampopen/close tray etc.
1495ae652c0SPoul-Henning Kamp
150