1    Now that Alex has the basic lvm stuff in we need to add soft-raid-1
2    to it.  I have some ideas on how it could be implemented.
3
4    This is not set in stone at all, this is just me rattling off my
5    RAID-1 implementation ideas.  It isn't quite as complex as it sounds,
6    really!  I swear it isn't!  But if we could implement something like
7    this we would have the best soft-raid-1 implementation around.
8
9    Here are the basic problems which need to be solved:
10
11	* Allow partial downtimes for pieces of the mirror such that
12	  when the mirror becomes whole again the entire drive does not
13	  have to be copied.  Instead only the segments of the drive that
14	  are out of sync would be resynchronized.
15
16	  We want to avoid having to completely resynchronize the entire
17	  contents of a potentially multi-terabyte drive if one is
18	  taken offline temporarily and then brought back online.
19
20	* Allow mixed I/O errors on both drives making up the mirror
21	  without taking the entire mirror offline.
22
23	* Allow I/O read or write errors on one drive to degrade only
24	  the related segment and not the whole drive.
25
26	* Allow most writes to be asynchronous to the two drives making
27	  up the mirror up to the synchronization point.  Avoid unnecessary
28	  writes to the segment array on-media even through a synchronization
29	  point.
30
31	* Detect out-of-sync mirrors that are out of sync due to a system
32	  crash occuring prior to a synchronization point (i.e. when the
33	  drives themselves are just fine).  When this case occurs either
34	  copy is valid and one must be selected, but then the selected
35	  copy must be resynchronized to the other drive in the mirror
36	  to prevent the read data from 'changing' randomly from the point
37	  of view of whoever is reading it.
38
39    And my idea on implementation:
40
41	* Implement a segment descriptor array for each drive in the
42	  mirror, breaking the drive down into large pieces.  For
43	  example, 128MB per segment.  The segment array would be stored
44	  on both disks making up the mirror.  In addition, each disk will
45	  store the segment state for BOTH disks.
46
47	  Thus a 1TBx2 mirror would have 8192x4 segments (4 segment
48	  descriptors for each logical segment).  The segment descriptor
49	  array would idealy be small enough to cache in-memory.  Being
50	  able to cache it in-memory simplifies lookups.
51
52	  A segment descriptor would be, oh I don't know... probably
53	  16 bytes.  Leave room for expansion :-)
54
55	  Why does each disk need to store a segment descriptor for both
56	  disks?  So we can 'remember' the state of the dead disk on the
57	  live disk in order to resolve mismatches later on when the
58	  dead disk comes back to life.
59
60	* The state of the segment descriptor must be consulted when reading
61	  or writing.  Some states are in-memory-only states while others
62	  can exist on-media or in-memory.  The states are represented by
63	  a set of bit flags:
64
65	  MEDIA_UNSTABLE	0: The content is stable on-media and
66				   fully synchronized.
67
68				1: The content is unstable on-media
69				   (writes have been made and have not
70				    been completely synchronized to both
71				    drives).
72
73	  MEDIA_READ_DEGRADED	0: No I/O read error occured on this segment
74				1: I/O read error(s) occured on this segment
75
76	  MEDIA_WRITE_DEGRADED	0: No I/O write error occured on this segment
77				1: I/O write error(s) occured on this segment
78
79	  MEDIA_MASTER		0: Normal operation
80
81				1: Mastership operation for this segment
82				   on this drive, which is set when the
83				   other drive in the mirror has failed
84				   and writes are made to the drive that
85				   is still operational.
86
87	  UNINITIALIZED		0: The segment contains normal data.
88
89				1: The entire segment is empty and should
90				   read all zeros regardless of the actual
91				   content on the media.
92
93				   (Use for newly initialized mirrors as
94				   a way to avoid formatting the whole
95				   drive or SSD?).
96
97	  OLD_UNSTABLE		Copy of original MEDIA_UNSTABLE bit initially
98				read from the media.  This bit is only
99				recopied after the related segment has been
100				fully synchronized.
101
102	  OLD_MASTER		Copy of original MEDIA_MASTER bit initially
103				read from the media.  This bit is only
104				recopied after the related segment has been
105				fully synchronized.
106
107	  We probably need room for a serial number or timestamp in the
108	  segment descriptor as well in order to resolve certain situations.
109
110	* Since updating a segment descriptor on-media is expensive
111	  (requiring at least one disk synchronization command and of
112	  course a nasty seek), segment descriptors on-media are updated
113	  synchronously only when going from a STABLE to an UNSTABLE state,
114	  meaning the segment is undergoing active writing.
115
116	  Changing a segment descriptor from unstable to stable can be
117	  delayed indefinitely (synchronized on a long timer, like
118	  30 or 60 seconds).  All that happens if a crash occurs in the
119	  mean time is a little extra copying of segments occurs on
120	  reboot.  Theoretically anyway.
121
122    Ok, now what actions need to be taken to satisfy a read or write?
123    The actions taken will be based on the segment state for the segment
124    involved in the I/O.  Any I/O which crosses a segment boundary would
125    be split into two or more I/Os and treated separately.
126
127    Remember there are four descriptors for each segment, two on each drive:
128
129	DISK1 STATE stored on disk1
130	DISK2 STATE stored on disk1
131
132	DISK1 STATE stored on disk2
133	DISK2 STATE stored on disk2
134
135    In order to simplify matters any inconstencies between e.g. the DISK2
136    state as stored on disk1 and the DISK2 state as stored on disk2 would
137    be resolved immediately prior to initiation of the actual I/O.  Otherwise
138    the combination of four states is just too complex.
139
140    So if both drives are operational this resolution must take place.  If
141    only one drive is operational then the state stored in the segment
142    descriptors on that one operational drive is consulted to obtain the
143    state of both drives.
144
145    This is the hard part.  Lets take the mismatched cases first.  That is,
146    when the DISK2 STATE stored on DISK1 is different from the DISK2 STATE
147    stored on DISK2 (or vise-versa... disk1 state stored on each drive):
148
149	* If one of the two conflicting states has the UNSTABLE or MASTER
150	  bits set then set the same bits in the other.
151
152	  Basically just OR some of the bits together and store to
153	  both copies.  But not all of the bits.
154
155	* If doing a write operation and the segment is marked UNITIALIZED
156	  the entire segment must be zero-filled and the bit cleared prior
157	  to the write operation. ????  (needs more thought, maybe even a
158	  sub-bitmap. See later on in this email).
159
160    Ok, now we have done that we can just consider two states, one for
161    DISK1 and one for DISK2, coupled with the I/O operation:
162
163    WHEN READING:
164
165	* If MASTER is NOT set on either drive the read may be
166	  sent to either drive.
167
168	* If MASTER is set on one of the drives the read must be sent
169	  only to that drive.
170
171	* If MASTER is set on both drives then we are screwed.  This case
172	  can occur if one of the mirror drives goes down and a bunch of
173	  writes are made to the other, then system is rebooted and the
174	  original mirror drive comes up but the other drive goes down.
175
176	  So this condition detects a conflict.  We must return an I/O
177	  error for the READ, presumably.  The only way to resolve this
178	  is for a manual intervention to explicitly select one or the
179	  other drive as the master.
180
181	* If READ_DEGRADED is set on one drive the read can be directed to
182	  the other.  If READ_DEGRADED is set on both drives then either
183	  drive can be selected.  If the read fails on any given drive
184	  it is of course redispatched to the other drive regardless.
185
186	  When READ_DEGRADED is set on one drive and only one drive is up
187	  we still issue the read to that drive, obviously, since we have
188	  no other choice.
189
190    WHEN WRITING:
191
192	* If MASTER is NOT set on either drive the write is directed to
193	  both drives.
194
195	* Otherwise a WRITE is directed only to the drive with MASTER set.
196
197	* If both drives are marked MASTER the write is directed to both
198	  drives.  This is a conflict situation on read but writing will
199	  still work just fine.  The MASTER bit is left alone.
200
201	* If an I/O error occurs on one of the drives the WRITE_DEGRADED
202	  bit is set for that drive and the other drive (where the write
203	  succeeded) is marked as MASTER.
204
205	  However, we can only do this if neither drive is already a MASTER.
206
207	  If a drive is already marked MASTER we cannot mark the other drive
208	  as MASTER.  The failed write will cause an I/O error to be
209	  returned.
210
211    RESYNCHRONIZATION:
212
213	* A kernel thread is created manage mirror synchronization.
214
215	* Synchronization of out-of-sync mirror segments can occur
216	  asynchnronously, but must interlock against I/O operations
217	  that might conflict.
218
219	  The segment array on the drive(s) is used to determine what
220	  segments need to be resynchronized.
221
222	* Synchronization occurs when the segment for one drive is
223	  marked MASTER and the segment for the other drive is not.
224
225	* In a conflict situation (where both drives are marked MASTER
226	  for any given segment) a manual intervention is required to
227	  specify (e.g. through an ioctl) which of the two drives is
228	  the master.  This overrides the MASTER bits for all segments
229	  and allows synchronization to occur for all conflicting
230	  segments (or possibly all segments, period, in the case where
231	  a new mirror drive is being deployed).
232
233    Segment array on-media and header.
234
235	* The mirroring code must reserve some of the sectors on the
236	  drives to hold a header and the segment array, making the
237	  resulting logical mirror a bit smaller than it otherwise would
238	  be.
239
240	* The header must contain a unique serial number (the uuid code
241	  can be used to generate it).
242
243	* When manual intervention is required to specify a master a new
244	  unique serial number must be generated for that master to
245	  prevent 'old' mirror drives that were removed from the system
246	  from being improperly recognized as being part of the new mirror
247	  when they aren't any more.
248
249	* Automatic detection of the mirror status is possible by using
250	  the serial number in the header.
251
252	* If the serial numbers for the header(s) for the two drives
253	  making up the mirror do not match (when both drives are up and
254	  both header read I/Os succeeded), manual intervention is required.
255
256	* Auto-detection of mirror segments ala Geom... using on-disk headers,
257	  is discouraged.  I think it is too dangerous and would much rather
258	  the detection be based on drive serial number rather than serial
259	  numbers stored on-media in headers.
260
261	  However, I guess this is a function of LVM?  So I might not have
262	  any control over it.
263
264    The UNINITIALIZED FLAG
265
266	When formatting a new mirror or when a drive is torn out and a new
267	drive is added the drive(s) in question must be formatted.  To
268	avoid actually writing to all sectors of the drive, which would
269	take too long on multi-terabyte drives and create unnecesary
270	writes on things like SSDs we instead of an UNINITIALIZED flag
271	state in the descriptor.
272
273	If set any read I/O to the related segment is simply zero-filled.
274
275	When writing we have to zero-fill the segment (write zeros to the
276	whole 128MB segment) and then clear the UNINITIALIZED flag before
277	allowing the write I/O to proceed.
278
279	We might want to use some of the bits in the descriptor as a
280	sub-bitmap.  e.g. if we reserve 4 bytes in the 16-byte descriptor
281	to be an 'UNINITIALIZED' sub-bitmap we can break the 128MB
282	segment down into 4MB pieces and only zero-fill/write portions
283	of the 128MB segment instead of having to do the whole segment.
284
285	I don't know how well this idea would work in real life.  Another
286	option is to just return random data for the uninitialized portions
287	of a new mirror but that kinda breaks the whole abstraction and
288	could blow up certain types of filesystems, like ZFS, which
289	assume any read data is stable on-media.
290
291
292						-Matt
293
294
295