1 Now that Alex has the basic lvm stuff in we need to add soft-raid-1 2 to it. I have some ideas on how it could be implemented. 3 4 This is not set in stone at all, this is just me rattling off my 5 RAID-1 implementation ideas. It isn't quite as complex as it sounds, 6 really! I swear it isn't! But if we could implement something like 7 this we would have the best soft-raid-1 implementation around. 8 9 Here are the basic problems which need to be solved: 10 11 * Allow partial downtimes for pieces of the mirror such that 12 when the mirror becomes whole again the entire drive does not 13 have to be copied. Instead only the segments of the drive that 14 are out of sync would be resynchronized. 15 16 We want to avoid having to completely resynchronize the entire 17 contents of a potentially multi-terabyte drive if one is 18 taken offline temporarily and then brought back online. 19 20 * Allow mixed I/O errors on both drives making up the mirror 21 without taking the entire mirror offline. 22 23 * Allow I/O read or write errors on one drive to degrade only 24 the related segment and not the whole drive. 25 26 * Allow most writes to be asynchronous to the two drives making 27 up the mirror up to the synchronization point. Avoid unnecessary 28 writes to the segment array on-media even through a synchronization 29 point. 30 31 * Detect out-of-sync mirrors that are out of sync due to a system 32 crash occuring prior to a synchronization point (i.e. when the 33 drives themselves are just fine). When this case occurs either 34 copy is valid and one must be selected, but then the selected 35 copy must be resynchronized to the other drive in the mirror 36 to prevent the read data from 'changing' randomly from the point 37 of view of whoever is reading it. 38 39 And my idea on implementation: 40 41 * Implement a segment descriptor array for each drive in the 42 mirror, breaking the drive down into large pieces. For 43 example, 128MB per segment. The segment array would be stored 44 on both disks making up the mirror. In addition, each disk will 45 store the segment state for BOTH disks. 46 47 Thus a 1TBx2 mirror would have 8192x4 segments (4 segment 48 descriptors for each logical segment). The segment descriptor 49 array would idealy be small enough to cache in-memory. Being 50 able to cache it in-memory simplifies lookups. 51 52 A segment descriptor would be, oh I don't know... probably 53 16 bytes. Leave room for expansion :-) 54 55 Why does each disk need to store a segment descriptor for both 56 disks? So we can 'remember' the state of the dead disk on the 57 live disk in order to resolve mismatches later on when the 58 dead disk comes back to life. 59 60 * The state of the segment descriptor must be consulted when reading 61 or writing. Some states are in-memory-only states while others 62 can exist on-media or in-memory. The states are represented by 63 a set of bit flags: 64 65 MEDIA_UNSTABLE 0: The content is stable on-media and 66 fully synchronized. 67 68 1: The content is unstable on-media 69 (writes have been made and have not 70 been completely synchronized to both 71 drives). 72 73 MEDIA_READ_DEGRADED 0: No I/O read error occured on this segment 74 1: I/O read error(s) occured on this segment 75 76 MEDIA_WRITE_DEGRADED 0: No I/O write error occured on this segment 77 1: I/O write error(s) occured on this segment 78 79 MEDIA_MASTER 0: Normal operation 80 81 1: Mastership operation for this segment 82 on this drive, which is set when the 83 other drive in the mirror has failed 84 and writes are made to the drive that 85 is still operational. 86 87 UNINITIALIZED 0: The segment contains normal data. 88 89 1: The entire segment is empty and should 90 read all zeros regardless of the actual 91 content on the media. 92 93 (Use for newly initialized mirrors as 94 a way to avoid formatting the whole 95 drive or SSD?). 96 97 OLD_UNSTABLE Copy of original MEDIA_UNSTABLE bit initially 98 read from the media. This bit is only 99 recopied after the related segment has been 100 fully synchronized. 101 102 OLD_MASTER Copy of original MEDIA_MASTER bit initially 103 read from the media. This bit is only 104 recopied after the related segment has been 105 fully synchronized. 106 107 We probably need room for a serial number or timestamp in the 108 segment descriptor as well in order to resolve certain situations. 109 110 * Since updating a segment descriptor on-media is expensive 111 (requiring at least one disk synchronization command and of 112 course a nasty seek), segment descriptors on-media are updated 113 synchronously only when going from a STABLE to an UNSTABLE state, 114 meaning the segment is undergoing active writing. 115 116 Changing a segment descriptor from unstable to stable can be 117 delayed indefinitely (synchronized on a long timer, like 118 30 or 60 seconds). All that happens if a crash occurs in the 119 mean time is a little extra copying of segments occurs on 120 reboot. Theoretically anyway. 121 122 Ok, now what actions need to be taken to satisfy a read or write? 123 The actions taken will be based on the segment state for the segment 124 involved in the I/O. Any I/O which crosses a segment boundary would 125 be split into two or more I/Os and treated separately. 126 127 Remember there are four descriptors for each segment, two on each drive: 128 129 DISK1 STATE stored on disk1 130 DISK2 STATE stored on disk1 131 132 DISK1 STATE stored on disk2 133 DISK2 STATE stored on disk2 134 135 In order to simplify matters any inconstencies between e.g. the DISK2 136 state as stored on disk1 and the DISK2 state as stored on disk2 would 137 be resolved immediately prior to initiation of the actual I/O. Otherwise 138 the combination of four states is just too complex. 139 140 So if both drives are operational this resolution must take place. If 141 only one drive is operational then the state stored in the segment 142 descriptors on that one operational drive is consulted to obtain the 143 state of both drives. 144 145 This is the hard part. Lets take the mismatched cases first. That is, 146 when the DISK2 STATE stored on DISK1 is different from the DISK2 STATE 147 stored on DISK2 (or vise-versa... disk1 state stored on each drive): 148 149 * If one of the two conflicting states has the UNSTABLE or MASTER 150 bits set then set the same bits in the other. 151 152 Basically just OR some of the bits together and store to 153 both copies. But not all of the bits. 154 155 * If doing a write operation and the segment is marked UNITIALIZED 156 the entire segment must be zero-filled and the bit cleared prior 157 to the write operation. ???? (needs more thought, maybe even a 158 sub-bitmap. See later on in this email). 159 160 Ok, now we have done that we can just consider two states, one for 161 DISK1 and one for DISK2, coupled with the I/O operation: 162 163 WHEN READING: 164 165 * If MASTER is NOT set on either drive the read may be 166 sent to either drive. 167 168 * If MASTER is set on one of the drives the read must be sent 169 only to that drive. 170 171 * If MASTER is set on both drives then we are screwed. This case 172 can occur if one of the mirror drives goes down and a bunch of 173 writes are made to the other, then system is rebooted and the 174 original mirror drive comes up but the other drive goes down. 175 176 So this condition detects a conflict. We must return an I/O 177 error for the READ, presumably. The only way to resolve this 178 is for a manual intervention to explicitly select one or the 179 other drive as the master. 180 181 * If READ_DEGRADED is set on one drive the read can be directed to 182 the other. If READ_DEGRADED is set on both drives then either 183 drive can be selected. If the read fails on any given drive 184 it is of course redispatched to the other drive regardless. 185 186 When READ_DEGRADED is set on one drive and only one drive is up 187 we still issue the read to that drive, obviously, since we have 188 no other choice. 189 190 WHEN WRITING: 191 192 * If MASTER is NOT set on either drive the write is directed to 193 both drives. 194 195 * Otherwise a WRITE is directed only to the drive with MASTER set. 196 197 * If both drives are marked MASTER the write is directed to both 198 drives. This is a conflict situation on read but writing will 199 still work just fine. The MASTER bit is left alone. 200 201 * If an I/O error occurs on one of the drives the WRITE_DEGRADED 202 bit is set for that drive and the other drive (where the write 203 succeeded) is marked as MASTER. 204 205 However, we can only do this if neither drive is already a MASTER. 206 207 If a drive is already marked MASTER we cannot mark the other drive 208 as MASTER. The failed write will cause an I/O error to be 209 returned. 210 211 RESYNCHRONIZATION: 212 213 * A kernel thread is created manage mirror synchronization. 214 215 * Synchronization of out-of-sync mirror segments can occur 216 asynchnronously, but must interlock against I/O operations 217 that might conflict. 218 219 The segment array on the drive(s) is used to determine what 220 segments need to be resynchronized. 221 222 * Synchronization occurs when the segment for one drive is 223 marked MASTER and the segment for the other drive is not. 224 225 * In a conflict situation (where both drives are marked MASTER 226 for any given segment) a manual intervention is required to 227 specify (e.g. through an ioctl) which of the two drives is 228 the master. This overrides the MASTER bits for all segments 229 and allows synchronization to occur for all conflicting 230 segments (or possibly all segments, period, in the case where 231 a new mirror drive is being deployed). 232 233 Segment array on-media and header. 234 235 * The mirroring code must reserve some of the sectors on the 236 drives to hold a header and the segment array, making the 237 resulting logical mirror a bit smaller than it otherwise would 238 be. 239 240 * The header must contain a unique serial number (the uuid code 241 can be used to generate it). 242 243 * When manual intervention is required to specify a master a new 244 unique serial number must be generated for that master to 245 prevent 'old' mirror drives that were removed from the system 246 from being improperly recognized as being part of the new mirror 247 when they aren't any more. 248 249 * Automatic detection of the mirror status is possible by using 250 the serial number in the header. 251 252 * If the serial numbers for the header(s) for the two drives 253 making up the mirror do not match (when both drives are up and 254 both header read I/Os succeeded), manual intervention is required. 255 256 * Auto-detection of mirror segments ala Geom... using on-disk headers, 257 is discouraged. I think it is too dangerous and would much rather 258 the detection be based on drive serial number rather than serial 259 numbers stored on-media in headers. 260 261 However, I guess this is a function of LVM? So I might not have 262 any control over it. 263 264 The UNINITIALIZED FLAG 265 266 When formatting a new mirror or when a drive is torn out and a new 267 drive is added the drive(s) in question must be formatted. To 268 avoid actually writing to all sectors of the drive, which would 269 take too long on multi-terabyte drives and create unnecesary 270 writes on things like SSDs we instead of an UNINITIALIZED flag 271 state in the descriptor. 272 273 If set any read I/O to the related segment is simply zero-filled. 274 275 When writing we have to zero-fill the segment (write zeros to the 276 whole 128MB segment) and then clear the UNINITIALIZED flag before 277 allowing the write I/O to proceed. 278 279 We might want to use some of the bits in the descriptor as a 280 sub-bitmap. e.g. if we reserve 4 bytes in the 16-byte descriptor 281 to be an 'UNINITIALIZED' sub-bitmap we can break the 128MB 282 segment down into 4MB pieces and only zero-fill/write portions 283 of the 128MB segment instead of having to do the whole segment. 284 285 I don't know how well this idea would work in real life. Another 286 option is to just return random data for the uninitialized portions 287 of a new mirror but that kinda breaks the whole abstraction and 288 could blow up certain types of filesystems, like ZFS, which 289 assume any read data is stable on-media. 290 291 292 -Matt 293 294 295