1801bb689SPoul-Henning Kamp$FreeBSD$ 2801bb689SPoul-Henning Kamp 3801bb689SPoul-Henning KampFor the lack of a better place to put them, this file will contain 4801bb689SPoul-Henning Kampnotes on some of the more intricate details of geom. 5801bb689SPoul-Henning Kamp 6801bb689SPoul-Henning Kamp----------------------------------------------------------------------- 7801bb689SPoul-Henning KampLocking of bio_children and bio_inbed 8801bb689SPoul-Henning Kamp 9801bb689SPoul-Henning Kampbio_children is used by g_std_done() and g_clone_bio() to keep track 10801bb689SPoul-Henning Kampof children cloned off a request. g_clone_bio will increment the 11801bb689SPoul-Henning Kampbio_children counter for each time it is called and g_std_done will 12801bb689SPoul-Henning Kampincrement bio_inbed for every call, and if the two counters are 13801bb689SPoul-Henning Kampequal, call g_io_deliver() on the parent bio. 14801bb689SPoul-Henning Kamp 15801bb689SPoul-Henning KampThe general assumption is that g_clone_bio() is called only in 16801bb689SPoul-Henning Kampthe g_down thread, and g_std_done() only in the g_up thread and 17801bb689SPoul-Henning Kamptherefore the two fields do not generally need locking. These 18801bb689SPoul-Henning Kamprestrictions are not enforced by the code, but only with great 19801bb689SPoul-Henning Kampcare should they be violated. 20801bb689SPoul-Henning Kamp 21801bb689SPoul-Henning KampIt is the responsibility of the class implementation to avoid the 22801bb689SPoul-Henning Kampfollowing race condition: A class intend to split a bio in two 23801bb689SPoul-Henning Kampchildren. It clones the bio, and requests I/O on the child. 24801bb689SPoul-Henning KampThis I/O operation completes before the second child is cloned 25801bb689SPoul-Henning Kampand g_std_done() sees the counters both equal 1 and finishes off 26801bb689SPoul-Henning Kampthe bio. 27801bb689SPoul-Henning Kamp 28801bb689SPoul-Henning KampThere is no race present in the common case where the bio is split 29801bb689SPoul-Henning Kampin multiple parts in the class start method and the I/O is requested 30801bb689SPoul-Henning Kampon another GEOM class below: There is only one g_down thread and 31801bb689SPoul-Henning Kampthe class below will not get its start method run until we return 32801bb689SPoul-Henning Kampfrom our start method, and consequently the I/O cannot complete 33801bb689SPoul-Henning Kampprematurely. 34801bb689SPoul-Henning Kamp 35801bb689SPoul-Henning KampIn all other cases, this race needs to be mitigated, for instance 36801bb689SPoul-Henning Kampby cloning all children before I/O is request on any of them. 37801bb689SPoul-Henning Kamp 38801bb689SPoul-Henning KampNotice that cloning an "extra" child and calling g_std_done() on 39801bb689SPoul-Henning Kampit directly opens another race since the assumption is that 40801bb689SPoul-Henning Kampg_std_done() only is called in the g_up thread. 41cce7303aSPoul-Henning Kamp 42cce7303aSPoul-Henning Kamp----------------------------------------------------------------------- 43cce7303aSPoul-Henning KampStatistics collection 44cce7303aSPoul-Henning Kamp 45cce7303aSPoul-Henning KampStatistics collection can run at three levels controlled by the 46cce7303aSPoul-Henning Kamp"kern.geom.collectstats" sysctl. 47cce7303aSPoul-Henning Kamp 48cce7303aSPoul-Henning KampAt level zero, only the number of transactions started and completed 49cce7303aSPoul-Henning Kampare counted, and this is only because GEOM internally uses the difference 50cce7303aSPoul-Henning Kampbetween these two as sanity checks. 51cce7303aSPoul-Henning Kamp 52cce7303aSPoul-Henning KampAt level one we collect the full statistics. Higher levels are 53cce7303aSPoul-Henning Kampreserved for future use. Statistics are collected independently 54cce7303aSPoul-Henning Kampon both the provider and the consumer, because multiple consumers 55cce7303aSPoul-Henning Kampcan be active against the same provider at the same time. 56cce7303aSPoul-Henning Kamp 57cce7303aSPoul-Henning KampThe statistics collection falls in two parts: 58cce7303aSPoul-Henning Kamp 59cce7303aSPoul-Henning KampThe first and simpler part consists of g_io_request() timestamping 60cce7303aSPoul-Henning Kampthe struct bio when the request is first started and g_io_deliver() 61cce7303aSPoul-Henning Kampupdating the consumer and providers statistics based on fields in 62cce7303aSPoul-Henning Kampthe bio when it is completed. There are no concurrency or locking 63cce7303aSPoul-Henning Kampconcerns in this part. The statistics collected consists of number 64cce7303aSPoul-Henning Kampof requests, number of bytes, number of ENOMEM errors, number of 65cce7303aSPoul-Henning Kampother errors and duration of the request for each of the three 66cce7303aSPoul-Henning Kampmajor request types: BIO_READ, BIO_WRITE and BIO_DELETE. 67cce7303aSPoul-Henning Kamp 68cce7303aSPoul-Henning KampThe second part is trying to keep track of the "busy%". 69cce7303aSPoul-Henning Kamp 70cce7303aSPoul-Henning KampIf in g_io_request() we find that there are no outstanding requests, 71cce7303aSPoul-Henning Kamp(based on the counters for scheduled and completed requests being 72cce7303aSPoul-Henning Kampequal), we set a timestamp in the "wentbusy" field. Since there 73cce7303aSPoul-Henning Kampare no outstanding requests, and as long as there is only one thread 74cce7303aSPoul-Henning Kamppushing the g_down queue, we cannot possibly conflict with 75cce7303aSPoul-Henning Kampg_io_deliver() until we ship the current request down. 76cce7303aSPoul-Henning Kamp 77cce7303aSPoul-Henning KampIn g_io_deliver() we calculate the delta-T from wentbusy and add this 78cce7303aSPoul-Henning Kampto the "bt" field, and set wentbusy to the current timestamp. We 79cce7303aSPoul-Henning Kamptake care to do this before we increment the "requests completed" 80cce7303aSPoul-Henning Kampcounter, since that prevents g_io_request() from touching the 81cce7303aSPoul-Henning Kamp"wentbusy" timestamp concurrently. 82cce7303aSPoul-Henning Kamp 83cce7303aSPoul-Henning KampThe statistics data is made available to userland through the use 84cce7303aSPoul-Henning Kampof a special allocator (in geom_stats.c) which through a device 85cce7303aSPoul-Henning Kampallows userland to mmap(2) the pages containing the statistics data. 86cce7303aSPoul-Henning KampIn order to indicate to userland when the data in a statstics 87cce7303aSPoul-Henning Kampstructure might be inconsistent, g_io_deliver() atomically sets a 88cce7303aSPoul-Henning Kampflag "updating" and resets it when the structure is again consistent. 898a63edc3SPoul-Henning Kamp----------------------------------------------------------------------- 908a63edc3SPoul-Henning Kampmaxsize, stripesize and stripeoffset 918a63edc3SPoul-Henning Kamp 928a63edc3SPoul-Henning Kampmaxsize is the biggest request we are willing to handle. If not 938a63edc3SPoul-Henning Kampset there is no upper bound on the size of a request and the code 948a63edc3SPoul-Henning Kampis responsible for chopping it up. Only hardware methods should 958a63edc3SPoul-Henning Kampset an upper bound in this field. Geom_disk will inherit the upper 968a63edc3SPoul-Henning Kampbound set by the device driver. 978a63edc3SPoul-Henning Kamp 988a63edc3SPoul-Henning Kampstripesize is the width of any natural request boundaries for the 998a63edc3SPoul-Henning Kampdevice. This would be the width of a stripe on a raid-5 unit or 1008a63edc3SPoul-Henning Kampone zone in GBDE. The idea with this field is to hint to clustering 1018a63edc3SPoul-Henning Kamptype code to not trivially overrun these boundaries. 1028a63edc3SPoul-Henning Kamp 1038a63edc3SPoul-Henning Kampstripeoffset is the amount of the first stripe which lies before the 1048a63edc3SPoul-Henning Kampdevices beginning. 1058a63edc3SPoul-Henning Kamp 1068a63edc3SPoul-Henning KampIf we have a device with 64k stripes: 1078a63edc3SPoul-Henning Kamp [0...64k[ 1088a63edc3SPoul-Henning Kamp [64k...128k[ 1098a63edc3SPoul-Henning Kamp [128k..192k[ 1108a63edc3SPoul-Henning KampThen it will have stripesize = 64k and stripeoffset = 0. 1118a63edc3SPoul-Henning Kamp 1128a63edc3SPoul-Henning KampIf we put a MBR on this device, where slice#1 starts on sector#63, 1138a63edc3SPoul-Henning Kampthen this slice will have: stripesize = 64k, stripeoffset = 63 * sectorsize. 1148a63edc3SPoul-Henning Kamp 1158a63edc3SPoul-Henning KampIf the clustering code wants to widen a request which writes to 1168a63edc3SPoul-Henning Kampsector#53 of the slice, it can calculate how many bytes till the end of 1178a63edc3SPoul-Henning Kampthe stripe as: 1188a63edc3SPoul-Henning Kamp stripewith - (53 * sectorsize + stripeoffset) % stripewidth. 119