1801bb689SPoul-Henning Kamp 2801bb689SPoul-Henning KampFor the lack of a better place to put them, this file will contain 3801bb689SPoul-Henning Kampnotes on some of the more intricate details of geom. 4801bb689SPoul-Henning Kamp 5801bb689SPoul-Henning Kamp----------------------------------------------------------------------- 6801bb689SPoul-Henning KampLocking of bio_children and bio_inbed 7801bb689SPoul-Henning Kamp 8801bb689SPoul-Henning Kampbio_children is used by g_std_done() and g_clone_bio() to keep track 9801bb689SPoul-Henning Kampof children cloned off a request. g_clone_bio will increment the 10801bb689SPoul-Henning Kampbio_children counter for each time it is called and g_std_done will 11801bb689SPoul-Henning Kampincrement bio_inbed for every call, and if the two counters are 12801bb689SPoul-Henning Kampequal, call g_io_deliver() on the parent bio. 13801bb689SPoul-Henning Kamp 14801bb689SPoul-Henning KampThe general assumption is that g_clone_bio() is called only in 15801bb689SPoul-Henning Kampthe g_down thread, and g_std_done() only in the g_up thread and 16801bb689SPoul-Henning Kamptherefore the two fields do not generally need locking. These 17801bb689SPoul-Henning Kamprestrictions are not enforced by the code, but only with great 18801bb689SPoul-Henning Kampcare should they be violated. 19801bb689SPoul-Henning Kamp 20801bb689SPoul-Henning KampIt is the responsibility of the class implementation to avoid the 21801bb689SPoul-Henning Kampfollowing race condition: A class intend to split a bio in two 22801bb689SPoul-Henning Kampchildren. It clones the bio, and requests I/O on the child. 23801bb689SPoul-Henning KampThis I/O operation completes before the second child is cloned 24801bb689SPoul-Henning Kampand g_std_done() sees the counters both equal 1 and finishes off 25801bb689SPoul-Henning Kampthe bio. 26801bb689SPoul-Henning Kamp 27801bb689SPoul-Henning KampThere is no race present in the common case where the bio is split 28801bb689SPoul-Henning Kampin multiple parts in the class start method and the I/O is requested 29801bb689SPoul-Henning Kampon another GEOM class below: There is only one g_down thread and 30801bb689SPoul-Henning Kampthe class below will not get its start method run until we return 31801bb689SPoul-Henning Kampfrom our start method, and consequently the I/O cannot complete 32801bb689SPoul-Henning Kampprematurely. 33801bb689SPoul-Henning Kamp 34801bb689SPoul-Henning KampIn all other cases, this race needs to be mitigated, for instance 35801bb689SPoul-Henning Kampby cloning all children before I/O is request on any of them. 36801bb689SPoul-Henning Kamp 37801bb689SPoul-Henning KampNotice that cloning an "extra" child and calling g_std_done() on 38801bb689SPoul-Henning Kampit directly opens another race since the assumption is that 39801bb689SPoul-Henning Kampg_std_done() only is called in the g_up thread. 40cce7303aSPoul-Henning Kamp 41cce7303aSPoul-Henning Kamp----------------------------------------------------------------------- 42cce7303aSPoul-Henning KampStatistics collection 43cce7303aSPoul-Henning Kamp 44cce7303aSPoul-Henning KampStatistics collection can run at three levels controlled by the 45cce7303aSPoul-Henning Kamp"kern.geom.collectstats" sysctl. 46cce7303aSPoul-Henning Kamp 47cce7303aSPoul-Henning KampAt level zero, only the number of transactions started and completed 48cce7303aSPoul-Henning Kampare counted, and this is only because GEOM internally uses the difference 49cce7303aSPoul-Henning Kampbetween these two as sanity checks. 50cce7303aSPoul-Henning Kamp 51cce7303aSPoul-Henning KampAt level one we collect the full statistics. Higher levels are 52cce7303aSPoul-Henning Kampreserved for future use. Statistics are collected independently 53cce7303aSPoul-Henning Kampon both the provider and the consumer, because multiple consumers 54cce7303aSPoul-Henning Kampcan be active against the same provider at the same time. 55cce7303aSPoul-Henning Kamp 56cce7303aSPoul-Henning KampThe statistics collection falls in two parts: 57cce7303aSPoul-Henning Kamp 58cce7303aSPoul-Henning KampThe first and simpler part consists of g_io_request() timestamping 59cce7303aSPoul-Henning Kampthe struct bio when the request is first started and g_io_deliver() 60cce7303aSPoul-Henning Kampupdating the consumer and providers statistics based on fields in 61cce7303aSPoul-Henning Kampthe bio when it is completed. There are no concurrency or locking 62cce7303aSPoul-Henning Kampconcerns in this part. The statistics collected consists of number 63cce7303aSPoul-Henning Kampof requests, number of bytes, number of ENOMEM errors, number of 64cce7303aSPoul-Henning Kampother errors and duration of the request for each of the three 65cce7303aSPoul-Henning Kampmajor request types: BIO_READ, BIO_WRITE and BIO_DELETE. 66cce7303aSPoul-Henning Kamp 67cce7303aSPoul-Henning KampThe second part is trying to keep track of the "busy%". 68cce7303aSPoul-Henning Kamp 69cce7303aSPoul-Henning KampIf in g_io_request() we find that there are no outstanding requests, 70cce7303aSPoul-Henning Kamp(based on the counters for scheduled and completed requests being 71cce7303aSPoul-Henning Kampequal), we set a timestamp in the "wentbusy" field. Since there 72cce7303aSPoul-Henning Kampare no outstanding requests, and as long as there is only one thread 73cce7303aSPoul-Henning Kamppushing the g_down queue, we cannot possibly conflict with 74cce7303aSPoul-Henning Kampg_io_deliver() until we ship the current request down. 75cce7303aSPoul-Henning Kamp 76cce7303aSPoul-Henning KampIn g_io_deliver() we calculate the delta-T from wentbusy and add this 77cce7303aSPoul-Henning Kampto the "bt" field, and set wentbusy to the current timestamp. We 78cce7303aSPoul-Henning Kamptake care to do this before we increment the "requests completed" 79cce7303aSPoul-Henning Kampcounter, since that prevents g_io_request() from touching the 80cce7303aSPoul-Henning Kamp"wentbusy" timestamp concurrently. 81cce7303aSPoul-Henning Kamp 82cce7303aSPoul-Henning KampThe statistics data is made available to userland through the use 83cce7303aSPoul-Henning Kampof a special allocator (in geom_stats.c) which through a device 84cce7303aSPoul-Henning Kampallows userland to mmap(2) the pages containing the statistics data. 85cce7303aSPoul-Henning KampIn order to indicate to userland when the data in a statstics 86cce7303aSPoul-Henning Kampstructure might be inconsistent, g_io_deliver() atomically sets a 87cce7303aSPoul-Henning Kampflag "updating" and resets it when the structure is again consistent. 888a63edc3SPoul-Henning Kamp----------------------------------------------------------------------- 898a63edc3SPoul-Henning Kampmaxsize, stripesize and stripeoffset 908a63edc3SPoul-Henning Kamp 918a63edc3SPoul-Henning Kampmaxsize is the biggest request we are willing to handle. If not 928a63edc3SPoul-Henning Kampset there is no upper bound on the size of a request and the code 938a63edc3SPoul-Henning Kampis responsible for chopping it up. Only hardware methods should 948a63edc3SPoul-Henning Kampset an upper bound in this field. Geom_disk will inherit the upper 958a63edc3SPoul-Henning Kampbound set by the device driver. 968a63edc3SPoul-Henning Kamp 978a63edc3SPoul-Henning Kampstripesize is the width of any natural request boundaries for the 98*69956de3SPoul-Henning Kampdevice. This would be the optimal width of a stripe on a raid unit. 99*69956de3SPoul-Henning KampThe idea with this field is to hint to clustering type code to not 100*69956de3SPoul-Henning Kamptrivially overrun these boundaries. 1018a63edc3SPoul-Henning Kamp 1028a63edc3SPoul-Henning Kampstripeoffset is the amount of the first stripe which lies before the 1038a63edc3SPoul-Henning Kampdevices beginning. 1048a63edc3SPoul-Henning Kamp 1058a63edc3SPoul-Henning KampIf we have a device with 64k stripes: 1068a63edc3SPoul-Henning Kamp [0...64k[ 1078a63edc3SPoul-Henning Kamp [64k...128k[ 1088a63edc3SPoul-Henning Kamp [128k..192k[ 1098a63edc3SPoul-Henning KampThen it will have stripesize = 64k and stripeoffset = 0. 1108a63edc3SPoul-Henning Kamp 1118a63edc3SPoul-Henning KampIf we put a MBR on this device, where slice#1 starts on sector#63, 1128a63edc3SPoul-Henning Kampthen this slice will have: stripesize = 64k, stripeoffset = 63 * sectorsize. 1138a63edc3SPoul-Henning Kamp 1148a63edc3SPoul-Henning KampIf the clustering code wants to widen a request which writes to 1158a63edc3SPoul-Henning Kampsector#53 of the slice, it can calculate how many bytes till the end of 1168a63edc3SPoul-Henning Kampthe stripe as: 1178a63edc3SPoul-Henning Kamp stripewith - (53 * sectorsize + stripeoffset) % stripewidth. 118679c4aa6SPoul-Henning Kamp----------------------------------------------------------------------- 119679c4aa6SPoul-Henning Kamp 120679c4aa6SPoul-Henning Kamp#include file usage: 121679c4aa6SPoul-Henning Kamp 122679c4aa6SPoul-Henning Kamp geom.h|geom_int.h|geom_ext.h|geom_ctl.h|libgeom.h 123679c4aa6SPoul-Henning Kamp----------------+------+----------+----------+----------+--------+ 124679c4aa6SPoul-Henning Kampgeom class | | | | | | 125679c4aa6SPoul-Henning Kampimplementation | X | | | | | 126679c4aa6SPoul-Henning Kamp----------------+------+----------+----------+----------+--------+ 127679c4aa6SPoul-Henning Kampgeom kernel | | | | | | 128679c4aa6SPoul-Henning Kampinfrastructure | X | X | X | X | | 129679c4aa6SPoul-Henning Kamp----------------+------+----------+----------+----------+--------+ 130679c4aa6SPoul-Henning Kamplibgeom | | | | | | 131679c4aa6SPoul-Henning Kampimplementation | | | X | X | X | 132679c4aa6SPoul-Henning Kamp----------------+------+----------+----------+----------+--------+ 133679c4aa6SPoul-Henning Kampgeom aware | | | | | | 134679c4aa6SPoul-Henning Kampapplication | | | | X | X | 135679c4aa6SPoul-Henning Kamp----------------+------+----------+----------+----------+--------+ 136679c4aa6SPoul-Henning Kamp 137679c4aa6SPoul-Henning Kampgeom_slice.h is special in that it documents a "library" for implementing 138679c4aa6SPoul-Henning Kampa specific kind of class, and consequently does not appear in the above 139679c4aa6SPoul-Henning Kampmatrix. 1405ae652c0SPoul-Henning Kamp----------------------------------------------------------------------- 1415ae652c0SPoul-Henning KampRemovable media. 1425ae652c0SPoul-Henning Kamp 1435ae652c0SPoul-Henning KampIn general, the theory is that a drive creates the provider when it has 1445ae652c0SPoul-Henning Kampa media and destroys it when the media disappears. 1455ae652c0SPoul-Henning Kamp 1465ae652c0SPoul-Henning KampIn a more realistic world, we will allow a provider to be opened medialess 1475ae652c0SPoul-Henning Kamp(set any sectorsize and a mediasize==0) in order to allow operations like 1485ae652c0SPoul-Henning Kampopen/close tray etc. 1495ae652c0SPoul-Henning Kamp 150