xref: /freebsd/sys/geom/notes (revision 8a63edc3)
1801bb689SPoul-Henning Kamp$FreeBSD$
2801bb689SPoul-Henning Kamp
3801bb689SPoul-Henning KampFor the lack of a better place to put them, this file will contain
4801bb689SPoul-Henning Kampnotes on some of the more intricate details of geom.
5801bb689SPoul-Henning Kamp
6801bb689SPoul-Henning Kamp-----------------------------------------------------------------------
7801bb689SPoul-Henning KampLocking of bio_children and bio_inbed
8801bb689SPoul-Henning Kamp
9801bb689SPoul-Henning Kampbio_children is used by g_std_done() and g_clone_bio() to keep track
10801bb689SPoul-Henning Kampof children cloned off a request.  g_clone_bio will increment the
11801bb689SPoul-Henning Kampbio_children counter for each time it is called and g_std_done will
12801bb689SPoul-Henning Kampincrement bio_inbed for every call, and if the two counters are
13801bb689SPoul-Henning Kampequal, call g_io_deliver() on the parent bio.
14801bb689SPoul-Henning Kamp
15801bb689SPoul-Henning KampThe general assumption is that g_clone_bio() is called only in
16801bb689SPoul-Henning Kampthe g_down thread, and g_std_done() only in the g_up thread and
17801bb689SPoul-Henning Kamptherefore the two fields do not generally need locking.  These
18801bb689SPoul-Henning Kamprestrictions are not enforced by the code, but only with great
19801bb689SPoul-Henning Kampcare should they be violated.
20801bb689SPoul-Henning Kamp
21801bb689SPoul-Henning KampIt is the responsibility of the class implementation to avoid the
22801bb689SPoul-Henning Kampfollowing race condition:  A class intend to split a bio in two
23801bb689SPoul-Henning Kampchildren.  It clones the bio, and requests I/O on the child.
24801bb689SPoul-Henning KampThis I/O operation completes before the second child is cloned
25801bb689SPoul-Henning Kampand g_std_done() sees the counters both equal 1 and finishes off
26801bb689SPoul-Henning Kampthe bio.
27801bb689SPoul-Henning Kamp
28801bb689SPoul-Henning KampThere is no race present in the common case where the bio is split
29801bb689SPoul-Henning Kampin multiple parts in the class start method and the I/O is requested
30801bb689SPoul-Henning Kampon another GEOM class below:  There is only one g_down thread and
31801bb689SPoul-Henning Kampthe class below will not get its start method run until we return
32801bb689SPoul-Henning Kampfrom our start method, and consequently the I/O cannot complete
33801bb689SPoul-Henning Kampprematurely.
34801bb689SPoul-Henning Kamp
35801bb689SPoul-Henning KampIn all other cases, this race needs to be mitigated, for instance
36801bb689SPoul-Henning Kampby cloning all children before I/O is request on any of them.
37801bb689SPoul-Henning Kamp
38801bb689SPoul-Henning KampNotice that cloning an "extra" child and calling g_std_done() on
39801bb689SPoul-Henning Kampit directly opens another race since the assumption is that
40801bb689SPoul-Henning Kampg_std_done() only is called in the g_up thread.
41cce7303aSPoul-Henning Kamp
42cce7303aSPoul-Henning Kamp-----------------------------------------------------------------------
43cce7303aSPoul-Henning KampStatistics collection
44cce7303aSPoul-Henning Kamp
45cce7303aSPoul-Henning KampStatistics collection can run at three levels controlled by the
46cce7303aSPoul-Henning Kamp"kern.geom.collectstats" sysctl.
47cce7303aSPoul-Henning Kamp
48cce7303aSPoul-Henning KampAt level zero, only the number of transactions started and completed
49cce7303aSPoul-Henning Kampare counted, and this is only because GEOM internally uses the difference
50cce7303aSPoul-Henning Kampbetween these two as sanity checks.
51cce7303aSPoul-Henning Kamp
52cce7303aSPoul-Henning KampAt level one we collect the full statistics.  Higher levels are
53cce7303aSPoul-Henning Kampreserved for future use.  Statistics are collected independently
54cce7303aSPoul-Henning Kampon both the provider and the consumer, because multiple consumers
55cce7303aSPoul-Henning Kampcan be active against the same provider at the same time.
56cce7303aSPoul-Henning Kamp
57cce7303aSPoul-Henning KampThe statistics collection falls in two parts:
58cce7303aSPoul-Henning Kamp
59cce7303aSPoul-Henning KampThe first and simpler part consists of g_io_request() timestamping
60cce7303aSPoul-Henning Kampthe struct bio when the request is first started and g_io_deliver()
61cce7303aSPoul-Henning Kampupdating the consumer and providers statistics based on fields in
62cce7303aSPoul-Henning Kampthe bio when it is completed.  There are no concurrency or locking
63cce7303aSPoul-Henning Kampconcerns in this part.  The statistics collected consists of number
64cce7303aSPoul-Henning Kampof requests, number of bytes, number of ENOMEM errors, number of
65cce7303aSPoul-Henning Kampother errors and duration of the request for each of the three
66cce7303aSPoul-Henning Kampmajor request types: BIO_READ, BIO_WRITE and BIO_DELETE.
67cce7303aSPoul-Henning Kamp
68cce7303aSPoul-Henning KampThe second part is trying to keep track of the "busy%".
69cce7303aSPoul-Henning Kamp
70cce7303aSPoul-Henning KampIf in g_io_request() we find that there are no outstanding requests,
71cce7303aSPoul-Henning Kamp(based on the counters for scheduled and completed requests being
72cce7303aSPoul-Henning Kampequal), we set a timestamp in the "wentbusy" field.  Since there
73cce7303aSPoul-Henning Kampare no outstanding requests, and as long as there is only one thread
74cce7303aSPoul-Henning Kamppushing the g_down queue, we cannot possibly conflict with
75cce7303aSPoul-Henning Kampg_io_deliver() until we ship the current request down.
76cce7303aSPoul-Henning Kamp
77cce7303aSPoul-Henning KampIn g_io_deliver() we calculate the delta-T from wentbusy and add this
78cce7303aSPoul-Henning Kampto the "bt" field, and set wentbusy to the current timestamp.  We
79cce7303aSPoul-Henning Kamptake care to do this before we increment the "requests completed"
80cce7303aSPoul-Henning Kampcounter, since that prevents g_io_request() from touching the
81cce7303aSPoul-Henning Kamp"wentbusy" timestamp concurrently.
82cce7303aSPoul-Henning Kamp
83cce7303aSPoul-Henning KampThe statistics data is made available to userland through the use
84cce7303aSPoul-Henning Kampof a special allocator (in geom_stats.c) which through a device
85cce7303aSPoul-Henning Kampallows userland to mmap(2) the pages containing the statistics data.
86cce7303aSPoul-Henning KampIn order to indicate to userland when the data in a statstics
87cce7303aSPoul-Henning Kampstructure might be inconsistent, g_io_deliver() atomically sets a
88cce7303aSPoul-Henning Kampflag "updating" and resets it when the structure is again consistent.
898a63edc3SPoul-Henning Kamp-----------------------------------------------------------------------
908a63edc3SPoul-Henning Kampmaxsize, stripesize and stripeoffset
918a63edc3SPoul-Henning Kamp
928a63edc3SPoul-Henning Kampmaxsize is the biggest request we are willing to handle.  If not
938a63edc3SPoul-Henning Kampset there is no upper bound on the size of a request and the code
948a63edc3SPoul-Henning Kampis responsible for chopping it up.  Only hardware methods should
958a63edc3SPoul-Henning Kampset an upper bound in this field.  Geom_disk will inherit the upper
968a63edc3SPoul-Henning Kampbound set by the device driver.
978a63edc3SPoul-Henning Kamp
988a63edc3SPoul-Henning Kampstripesize is the width of any natural request boundaries for the
998a63edc3SPoul-Henning Kampdevice.  This would be the width of a stripe on a raid-5 unit or
1008a63edc3SPoul-Henning Kampone zone in GBDE.  The idea with this field is to hint to clustering
1018a63edc3SPoul-Henning Kamptype code to not trivially overrun these boundaries.
1028a63edc3SPoul-Henning Kamp
1038a63edc3SPoul-Henning Kampstripeoffset is the amount of the first stripe which lies before the
1048a63edc3SPoul-Henning Kampdevices beginning.
1058a63edc3SPoul-Henning Kamp
1068a63edc3SPoul-Henning KampIf we have a device with 64k stripes:
1078a63edc3SPoul-Henning Kamp	[0...64k[
1088a63edc3SPoul-Henning Kamp	[64k...128k[
1098a63edc3SPoul-Henning Kamp	[128k..192k[
1108a63edc3SPoul-Henning KampThen it will have stripesize = 64k and stripeoffset = 0.
1118a63edc3SPoul-Henning Kamp
1128a63edc3SPoul-Henning KampIf we put a MBR on this device, where slice#1 starts on sector#63,
1138a63edc3SPoul-Henning Kampthen this slice will have: stripesize = 64k, stripeoffset = 63 * sectorsize.
1148a63edc3SPoul-Henning Kamp
1158a63edc3SPoul-Henning KampIf the clustering code wants to widen a request which writes to
1168a63edc3SPoul-Henning Kampsector#53 of the slice, it can calculate how many bytes till the end of
1178a63edc3SPoul-Henning Kampthe stripe as:
1188a63edc3SPoul-Henning Kamp	stripewith - (53 * sectorsize + stripeoffset) % stripewidth.
119