xref: /netbsd/sbin/raidctl/raidctl.8 (revision bf9ec67e)
1.\"     $NetBSD: raidctl.8,v 1.29 2002/02/08 01:30:45 ross Exp $
2.\"
3.\" Copyright (c) 1998 The NetBSD Foundation, Inc.
4.\" All rights reserved.
5.\"
6.\" This code is derived from software contributed to The NetBSD Foundation
7.\" by Greg Oster
8.\"
9.\" Redistribution and use in source and binary forms, with or without
10.\" modification, are permitted provided that the following conditions
11.\" are met:
12.\" 1. Redistributions of source code must retain the above copyright
13.\"    notice, this list of conditions and the following disclaimer.
14.\" 2. Redistributions in binary form must reproduce the above copyright
15.\"    notice, this list of conditions and the following disclaimer in the
16.\"    documentation and/or other materials provided with the distribution.
17.\" 3. All advertising materials mentioning features or use of this software
18.\"    must display the following acknowledgement:
19.\"        This product includes software developed by the NetBSD
20.\"        Foundation, Inc. and its contributors.
21.\" 4. Neither the name of The NetBSD Foundation nor the names of its
22.\"    contributors may be used to endorse or promote products derived
23.\"    from this software without specific prior written permission.
24.\"
25.\" THIS SOFTWARE IS PROVIDED BY THE NETBSD FOUNDATION, INC. AND CONTRIBUTORS
26.\" ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
27.\" TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
28.\" PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE FOUNDATION OR CONTRIBUTORS
29.\" BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
30.\" CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
31.\" SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
32.\" INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
33.\" CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
34.\" ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
35.\" POSSIBILITY OF SUCH DAMAGE.
36.\"
37.\"
38.\" Copyright (c) 1995 Carnegie-Mellon University.
39.\" All rights reserved.
40.\"
41.\" Author: Mark Holland
42.\"
43.\" Permission to use, copy, modify and distribute this software and
44.\" its documentation is hereby granted, provided that both the copyright
45.\" notice and this permission notice appear in all copies of the
46.\" software, derivative works or modified versions, and any portions
47.\" thereof, and that both notices appear in supporting documentation.
48.\"
49.\" CARNEGIE MELLON ALLOWS FREE USE OF THIS SOFTWARE IN ITS "AS IS"
50.\" CONDITION.  CARNEGIE MELLON DISCLAIMS ANY LIABILITY OF ANY KIND
51.\" FOR ANY DAMAGES WHATSOEVER RESULTING FROM THE USE OF THIS SOFTWARE.
52.\"
53.\" Carnegie Mellon requests users of this software to return to
54.\"
55.\"  Software Distribution Coordinator  or  Software.Distribution@CS.CMU.EDU
56.\"  School of Computer Science
57.\"  Carnegie Mellon University
58.\"  Pittsburgh PA 15213-3890
59.\"
60.\" any improvements or extensions that they make and grant Carnegie the
61.\" rights to redistribute these changes.
62.\"
63.Dd July 10, 2001
64.Dt RAIDCTL 8
65.Os
66.Sh NAME
67.Nm raidctl
68.Nd configuration utility for the RAIDframe disk driver
69.Sh SYNOPSIS
70.Nm ""
71.Op Fl v
72.Fl a Ar component Ar dev
73.Nm ""
74.Op Fl v
75.Fl A Op yes | no | root
76.Ar dev
77.Nm ""
78.Op Fl v
79.Fl B Ar dev
80.Nm ""
81.Op Fl v
82.Fl c Ar config_file Ar dev
83.Nm ""
84.Op Fl v
85.Fl C Ar config_file Ar dev
86.Nm ""
87.Op Fl v
88.Fl f Ar component Ar dev
89.Nm ""
90.Op Fl v
91.Fl F Ar component Ar dev
92.Nm ""
93.Op Fl v
94.Fl g Ar component Ar dev
95.Nm ""
96.Op Fl v
97.Fl G Ar dev
98.Nm ""
99.Op Fl v
100.Fl i Ar dev
101.Nm ""
102.Op Fl v
103.Fl I Ar serial_number Ar dev
104.Nm ""
105.Op Fl v
106.Fl p Ar dev
107.Nm ""
108.Op Fl v
109.Fl P Ar dev
110.Nm ""
111.Op Fl v
112.Fl r Ar component Ar dev
113.Nm ""
114.Op Fl v
115.Fl R Ar component Ar dev
116.Nm ""
117.Op Fl v
118.Fl s Ar dev
119.Nm ""
120.Op Fl v
121.Fl S Ar dev
122.Nm ""
123.Op Fl v
124.Fl u Ar dev
125.Sh DESCRIPTION
126.Nm ""
127is the user-land control program for
128.Xr raid 4 ,
129the RAIDframe disk device.
130.Nm ""
131is primarily used to dynamically configure and unconfigure RAIDframe disk
132devices.  For more information about the RAIDframe disk device, see
133.Xr raid 4 .
134.Pp
135This document assumes the reader has at least rudimentary knowledge of
136RAID and RAID concepts.
137.Pp
138The command-line options for
139.Nm
140are as follows:
141.Bl -tag -width indent
142.It Fl a Ar component Ar dev
143Add
144.Ar component
145as a hot spare for the device
146.Ar dev .
147.It Fl A Ic yes Ar dev
148Make the RAID set auto-configurable.  The RAID set will be
149automatically configured at boot
150.Ar before
151the root file system is
152mounted.  Note that all components of the set must be of type RAID in the
153disklabel.
154.It Fl A Ic no Ar dev
155Turn off auto-configuration for the RAID set.
156.It Fl A Ic root Ar dev
157Make the RAID set auto-configurable, and also mark the set as being
158eligible to be the root partition.  A RAID set configured this way
159will
160.Ar override
161the use of the boot disk as the root device.  All components of the
162set must be of type RAID in the disklabel.  Note that the kernel being
163booted must currently reside on a non-RAID set.
164.It Fl B Ar dev
165Initiate a copyback of reconstructed data from a spare disk to
166its original disk.  This is performed after a component has failed,
167and the failed drive has been reconstructed onto a spare drive.
168.It Fl c Ar config_file Ar dev
169Configure the RAIDframe device
170.Ar dev
171according to the configuration given in
172.Ar config_file .
173A description of the contents of
174.Ar config_file
175is given later.
176.It Fl C Ar config_file Ar dev
177As for
178.Ar -c ,
179but forces the configuration to take place.  This is required the
180first time a RAID set is configured.
181.It Fl f Ar component Ar dev
182This marks the specified
183.Ar component
184as having failed, but does not initiate a reconstruction of that
185component.
186.It Fl F Ar component Ar dev
187Fails the specified
188.Ar component
189of the device, and immediately begin a reconstruction of the failed
190disk onto an available hot spare.  This is one of the mechanisms used to start
191the reconstruction process if a component does have a hardware failure.
192.It Fl g Ar component Ar dev
193Get the component label for the specified component.
194.It Fl G Ar dev
195Generate the configuration of the RAIDframe device in a format suitable for
196use with
197.Nm
198.Fl c
199or
200.Fl C .
201.It Fl i Ar dev
202Initialize the RAID device.  In particular, (re-write) the parity on
203the selected device.  This
204.Ar MUST
205be done for
206.Ar all
207RAID sets before the RAID device is labeled and before
208file systems are created on the RAID device.
209.It Fl I Ar serial_number Ar dev
210Initialize the component labels on each component of the device.
211.Ar serial_number
212is used as one of the keys in determining whether a
213particular set of components belong to the same RAID set.  While not
214strictly enforced, different serial numbers should be used for
215different RAID sets.  This step
216.Ar MUST
217be performed when a new RAID set is created.
218.It Fl p Ar dev
219Check the status of the parity on the RAID set.  Displays a status
220message, and returns successfully if the parity is up-to-date.
221.It Fl P Ar dev
222Check the status of the parity on the RAID set, and initialize
223(re-write) the parity if the parity is not known to be up-to-date.
224This is normally used after a system crash (and before a
225.Xr fsck 8 )
226to ensure the integrity of the parity.
227.It Fl r Ar component Ar dev
228Remove the spare disk specified by
229.Ar component
230from the set of available spare components.
231.It Fl R Ar component Ar dev
232Fails the specified
233.Ar component ,
234if necessary, and immediately begins a reconstruction back to
235.Ar component .
236This is useful for reconstructing back onto a component after
237it has been replaced following a failure.
238.It Fl s Ar dev
239Display the status of the RAIDframe device for each of the components
240and spares.
241.It Fl S Ar dev
242Check the status of parity re-writing, component reconstruction, and
243component copyback.  The output indicates the amount of progress
244achieved in each of these areas.
245.It Fl u Ar dev
246Unconfigure the RAIDframe device.
247.It Fl v
248Be more verbose.  For operations such as reconstructions, parity
249re-writing, and copybacks, provide a progress indicator.
250.El
251.Pp
252The device used by
253.Nm
254is specified by
255.Ar dev .
256.Ar dev
257may be either the full name of the device, e.g. /dev/rraid0d,
258for the i386 architecture, and /dev/rraid0c
259for all others, or just simply raid0 (for /dev/rraid0d).
260.Ss Configuration file
261The format of the configuration file is complex, and
262only an abbreviated treatment is given here.  In the configuration
263files, a
264.Sq #
265indicates the beginning of a comment.
266.Pp
267There are 4 required sections of a configuration file, and 2
268optional sections.  Each section begins with a
269.Sq START ,
270followed by
271the section name, and the configuration parameters associated with that
272section.  The first section is the
273.Sq array
274section, and it specifies
275the number of rows, columns, and spare disks in the RAID set.  For
276example:
277.Bd -literal -offset indent
278START array
2791 3 0
280.Ed
281.Pp
282indicates an array with 1 row, 3 columns, and 0 spare disks.  Note
283that although multi-dimensional arrays may be specified, they are
284.Ar NOT
285supported in the driver.
286.Pp
287The second section, the
288.Sq disks
289section, specifies the actual
290components of the device.  For example:
291.Bd -literal -offset indent
292START disks
293/dev/sd0e
294/dev/sd1e
295/dev/sd2e
296.Ed
297.Pp
298specifies the three component disks to be used in the RAID device.  If
299any of the specified drives cannot be found when the RAID device is
300configured, then they will be marked as
301.Sq failed ,
302and the system will
303operate in degraded mode.  Note that it is
304.Ar imperative
305that the order of the components in the configuration file does not
306change between configurations of a RAID device.  Changing the order
307of the components will result in data loss if the set is configured
308with the
309.Fl C
310option.  In normal circumstances, the RAID set will not configure if
311only
312.Fl c
313is specified, and the components are out-of-order.
314.Pp
315The next section, which is the
316.Sq spare
317section, is optional, and, if
318present, specifies the devices to be used as
319.Sq hot spares
320-- devices
321which are on-line, but are not actively used by the RAID driver unless
322one of the main components fail.  A simple
323.Sq spare
324section might be:
325.Bd -literal -offset indent
326START spare
327/dev/sd3e
328.Ed
329.Pp
330for a configuration with a single spare component.  If no spare drives
331are to be used in the configuration, then the
332.Sq spare
333section may be omitted.
334.Pp
335The next section is the
336.Sq layout
337section.  This section describes the
338general layout parameters for the RAID device, and provides such
339information as sectors per stripe unit, stripe units per parity unit,
340stripe units per reconstruction unit, and the parity configuration to
341use.  This section might look like:
342.Bd -literal -offset indent
343START layout
344# sectPerSU SUsPerParityUnit SUsPerReconUnit RAID_level
34532 1 1 5
346.Ed
347.Pp
348The sectors per stripe unit specifies, in blocks, the interleave
349factor; i.e. the number of contiguous sectors to be written to each
350component for a single stripe.  Appropriate selection of this value
351(32 in this example) is the subject of much research in RAID
352architectures.  The stripe units per parity unit and
353stripe units per reconstruction unit are normally each set to 1.
354While certain values above 1 are permitted, a discussion of valid
355values and the consequences of using anything other than 1 are outside
356the scope of this document.  The last value in this section (5 in this
357example) indicates the parity configuration desired.  Valid entries
358include:
359.Bl -tag -width inde
360.It 0
361RAID level 0.  No parity, only simple striping.
362.It 1
363RAID level 1.  Mirroring.  The parity is the mirror.
364.It 4
365RAID level 4.  Striping across components, with parity stored on the
366last component.
367.It 5
368RAID level 5.  Striping across components, parity distributed across
369all components.
370.El
371.Pp
372There are other valid entries here, including those for Even-Odd
373parity, RAID level 5 with rotated sparing, Chained declustering,
374and Interleaved declustering, but as of this writing the code for
375those parity operations has not been tested with
376.Nx .
377.Pp
378The next required section is the
379.Sq queue
380section.  This is most often
381specified as:
382.Bd -literal -offset indent
383START queue
384fifo 100
385.Ed
386.Pp
387where the queuing method is specified as fifo (first-in, first-out),
388and the size of the per-component queue is limited to 100 requests.
389Other queuing methods may also be specified, but a discussion of them
390is beyond the scope of this document.
391.Pp
392The final section, the
393.Sq debug
394section, is optional.  For more details
395on this the reader is referred to the RAIDframe documentation
396discussed in the
397.Sx HISTORY
398section.
399.Pp
400See
401.Sx EXAMPLES
402for a more complete configuration file example.
403.Sh FILES
404.Bl -tag -width /dev/XXrXraidX -compact
405.It Pa /dev/{,r}raid*
406.Cm raid
407device special files.
408.El
409.Sh EXAMPLES
410It is highly recommended that before using the RAID driver for real
411file systems that the system administrator(s) become quite familiar
412with the use of
413.Nm "" ,
414and that they understand how the component reconstruction process
415works.  The examples in this section will focus on configuring a
416number of different RAID sets of varying degrees of redundancy.
417By working through these examples, administrators should be able to
418develop a good feel for how to configure a RAID set, and how to
419initiate reconstruction of failed components.
420.Pp
421In the following examples
422.Sq raid0
423will be used to denote the RAID device.  Depending on the
424architecture,
425.Sq /dev/rraid0c
426or
427.Sq /dev/rraid0d
428may be used in place of
429.Sq raid0 .
430.Ss Initialization and Configuration
431The initial step in configuring a RAID set is to identify the components
432that will be used in the RAID set.  All components should be the same
433size.  Each component should have a disklabel type of
434.Dv FS_RAID ,
435and a typical disklabel entry for a RAID component
436might look like:
437.Bd -literal -offset indent
438f:  1800000  200495     RAID              # (Cyl.  405*- 4041*)
439.Ed
440.Pp
441While
442.Dv FS_BSDFFS
443will also work as the component type, the type
444.Dv FS_RAID
445is preferred for RAIDframe use, as it is required for features such as
446auto-configuration.  As part of the initial configuration of each RAID
447set, each component will be given a
448.Sq component label .
449A
450.Sq component label
451contains important information about the component, including a
452user-specified serial number, the row and column of that component in
453the RAID set, the redundancy level of the RAID set, a 'modification
454counter', and whether the parity information (if any) on that
455component is known to be correct.  Component labels are an integral
456part of the RAID set, since they are used to ensure that components
457are configured in the correct order, and used to keep track of other
458vital information about the RAID set.  Component labels are also
459required for the auto-detection and auto-configuration of RAID sets at
460boot time.  For a component label to be considered valid, that
461particular component label must be in agreement with the other
462component labels in the set.  For example, the serial number,
463.Sq modification counter ,
464number of rows and number of columns must all
465be in agreement.  If any of these are different, then the component is
466not considered to be part of the set.  See
467.Xr raid 4
468for more information about component labels.
469.Pp
470Once the components have been identified, and the disks have
471appropriate labels,
472.Nm ""
473is then used to configure the
474.Xr raid 4
475device.  To configure the device, a configuration
476file which looks something like:
477.Bd -literal -offset indent
478START array
479# numRow numCol numSpare
4801 3 1
481
482START disks
483/dev/sd1e
484/dev/sd2e
485/dev/sd3e
486
487START spare
488/dev/sd4e
489
490START layout
491# sectPerSU SUsPerParityUnit SUsPerReconUnit RAID_level_5
49232 1 1 5
493
494START queue
495fifo 100
496.Ed
497.Pp
498is created in a file.  The above configuration file specifies a RAID 5
499set consisting of the components /dev/sd1e, /dev/sd2e, and /dev/sd3e,
500with /dev/sd4e available as a
501.Sq hot spare
502in case one of
503the three main drives should fail. A RAID 0 set would be specified in
504a similar way:
505.Bd -literal -offset indent
506START array
507# numRow numCol numSpare
5081 4 0
509
510START disks
511/dev/sd10e
512/dev/sd11e
513/dev/sd12e
514/dev/sd13e
515
516START layout
517# sectPerSU SUsPerParityUnit SUsPerReconUnit RAID_level_0
51864 1 1 0
519
520START queue
521fifo 100
522.Ed
523.Pp
524In this case, devices /dev/sd10e, /dev/sd11e, /dev/sd12e, and /dev/sd13e
525are the components that make up this RAID set.  Note that there are no
526hot spares for a RAID 0 set, since there is no way to recover data if
527any of the components fail.
528.Pp
529For a RAID 1 (mirror) set, the following configuration might be used:
530.Bd -literal -offset indent
531START array
532# numRow numCol numSpare
5331 2 0
534
535START disks
536/dev/sd20e
537/dev/sd21e
538
539START layout
540# sectPerSU SUsPerParityUnit SUsPerReconUnit RAID_level_1
541128 1 1 1
542
543START queue
544fifo 100
545.Ed
546.Pp
547In this case, /dev/sd20e and /dev/sd21e are the two components of the
548mirror set.  While no hot spares have been specified in this
549configuration, they easily could be, just as they were specified in
550the RAID 5 case above.  Note as well that RAID 1 sets are currently
551limited to only 2 components.  At present, n-way mirroring is not
552possible.
553.Pp
554The first time a RAID set is configured, the
555.Fl C
556option must be used:
557.Bd -literal -offset indent
558raidctl -C raid0.conf raid0
559.Ed
560.Pp
561where
562.Sq raid0.conf
563is the name of the RAID configuration file.  The
564.Fl C
565forces the configuration to succeed, even if any of the component
566labels are incorrect.  The
567.Fl C
568option should not be used lightly in
569situations other than initial configurations, as if
570the system is refusing to configure a RAID set, there is probably a
571very good reason for it.  After the initial configuration is done (and
572appropriate component labels are added with the
573.Fl I
574option) then raid0 can be configured normally with:
575.Bd -literal -offset indent
576raidctl -c raid0.conf raid0
577.Ed
578.Pp
579When the RAID set is configured for the first time, it is
580necessary to initialize the component labels, and to initialize the
581parity on the RAID set.  Initializing the component labels is done with:
582.Bd -literal -offset indent
583raidctl -I 112341 raid0
584.Ed
585.Pp
586where
587.Sq 112341
588is a user-specified serial number for the RAID set.  This
589initialization step is
590.Ar required
591for all RAID sets.  As well, using different
592serial numbers between RAID sets is
593.Ar strongly encouraged ,
594as using the same serial number for all RAID sets will only serve to
595decrease the usefulness of the component label checking.
596.Pp
597Initializing the RAID set is done via the
598.Fl i
599option.  This initialization
600.Ar MUST
601be done for
602.Ar all
603RAID sets, since among other things it verifies that the parity (if
604any) on the RAID set is correct.  Since this initialization may be
605quite time-consuming, the
606.Fl v
607option may be also used in conjunction with
608.Fl i :
609.Bd -literal -offset indent
610raidctl -iv raid0
611.Ed
612.Pp
613This will give more verbose output on the
614status of the initialization:
615.Bd -literal -offset indent
616Initiating re-write of parity
617Parity Re-write status:
618 10% |****                                   | ETA:    06:03 /
619.Ed
620.Pp
621The output provides a
622.Sq Percent Complete
623in both a numeric and graphical format, as well as an estimated time
624to completion of the operation.
625.Pp
626Since it is the parity that provides the
627.Sq redundancy
628part of RAID, it is critical that the parity is correct
629as much as possible.  If the parity is not correct, then there is no
630guarantee that data will not be lost if a component fails.
631.Pp
632Once the parity is known to be correct,
633it is then safe to perform
634.Xr disklabel 8 ,
635.Xr newfs 8 ,
636or
637.Xr fsck 8
638on the device or its file systems, and then to mount the file systems
639for use.
640.Pp
641Under certain circumstances (e.g. the additional component has not
642arrived, or data is being migrated off of a disk destined to become a
643component) it may be desirable to to configure a RAID 1 set with only
644a single component.  This can be achieved by configuring the set with
645a physically existing component (as either the first or second
646component) and with a
647.Sq fake
648component.  In the following:
649.Bd -literal -offset indent
650START array
651# numRow numCol numSpare
6521 2 0
653
654START disks
655/dev/sd6e
656/dev/sd0e
657
658START layout
659# sectPerSU SUsPerParityUnit SUsPerReconUnit RAID_level_1
660128 1 1 1
661
662START queue
663fifo 100
664.Ed
665.Pp
666/dev/sd0e is the real component, and will be the second disk of a RAID 1
667set.  The component /dev/sd6e, which must exist, but have no physical
668device associated with it, is simply used as a placeholder.
669Configuration (using
670.Fl C
671and
672.Fl I Ar 12345
673as above) proceeds normally, but initialization of the RAID set will
674have to wait until all physical components are present.  After
675configuration, this set can be used normally, but will be operating
676in degraded mode.  Once a second physical component is obtained, it
677can be hot-added, the existing data mirrored, and normal operation
678resumed.
679.Ss Maintenance of the RAID set
680After the parity has been initialized for the first time, the command:
681.Bd -literal -offset indent
682raidctl -p raid0
683.Ed
684.Pp
685can be used to check the current status of the parity.  To check the
686parity and rebuild it necessary (for example, after an unclean
687shutdown) the command:
688.Bd -literal -offset indent
689raidctl -P raid0
690.Ed
691.Pp
692is used.  Note that re-writing the parity can be done while
693other operations on the RAID set are taking place (e.g. while doing a
694.Xr fsck 8
695on a file system on the RAID set).  However: for maximum effectiveness
696of the RAID set, the parity should be known to be correct before any
697data on the set is modified.
698.Pp
699To see how the RAID set is doing, the following command can be used to
700show the RAID set's status:
701.Bd -literal -offset indent
702raidctl -s raid0
703.Ed
704.Pp
705The output will look something like:
706.Bd -literal -offset indent
707Components:
708           /dev/sd1e: optimal
709           /dev/sd2e: optimal
710           /dev/sd3e: optimal
711Spares:
712           /dev/sd4e: spare
713Component label for /dev/sd1e:
714   Row: 0 Column: 0 Num Rows: 1 Num Columns: 3
715   Version: 2 Serial Number: 13432 Mod Counter: 65
716   Clean: No Status: 0
717   sectPerSU: 32 SUsPerPU: 1 SUsPerRU: 1
718   RAID Level: 5  blocksize: 512 numBlocks: 1799936
719   Autoconfig: No
720   Last configured as: raid0
721Component label for /dev/sd2e:
722   Row: 0 Column: 1 Num Rows: 1 Num Columns: 3
723   Version: 2 Serial Number: 13432 Mod Counter: 65
724   Clean: No Status: 0
725   sectPerSU: 32 SUsPerPU: 1 SUsPerRU: 1
726   RAID Level: 5  blocksize: 512 numBlocks: 1799936
727   Autoconfig: No
728   Last configured as: raid0
729Component label for /dev/sd3e:
730   Row: 0 Column: 2 Num Rows: 1 Num Columns: 3
731   Version: 2 Serial Number: 13432 Mod Counter: 65
732   Clean: No Status: 0
733   sectPerSU: 32 SUsPerPU: 1 SUsPerRU: 1
734   RAID Level: 5  blocksize: 512 numBlocks: 1799936
735   Autoconfig: No
736   Last configured as: raid0
737Parity status: clean
738Reconstruction is 100% complete.
739Parity Re-write is 100% complete.
740Copyback is 100% complete.
741.Ed
742.Pp
743This indicates that all is well with the RAID set.  Of importance here
744are the component lines which read
745.Sq optimal ,
746and the
747.Sq Parity status
748line which indicates that the parity is up-to-date.  Note that if
749there are file systems open on the RAID set, the individual components
750will not be
751.Sq clean
752but the set as a whole can still be clean.
753.Pp
754To check the component label of /dev/sd1e, the following is used:
755.Bd -literal -offset indent
756raidctl -g /dev/sd1e raid0
757.Ed
758.Pp
759The output of this command will look something like:
760.Bd -literal -offset indent
761Component label for /dev/sd1e:
762   Row: 0 Column: 0 Num Rows: 1 Num Columns: 3
763   Version: 2 Serial Number: 13432 Mod Counter: 65
764   Clean: No Status: 0
765   sectPerSU: 32 SUsPerPU: 1 SUsPerRU: 1
766   RAID Level: 5  blocksize: 512 numBlocks: 1799936
767   Autoconfig: No
768   Last configured as: raid0
769.Ed
770.Ss Dealing with Component Failures
771If for some reason
772(perhaps to test reconstruction) it is necessary to pretend a drive
773has failed, the following will perform that function:
774.Bd -literal -offset indent
775raidctl -f /dev/sd2e raid0
776.Ed
777.Pp
778The system will then be performing all operations in degraded mode,
779where missing data is re-computed from existing data and the parity.
780In this case, obtaining the status of raid0 will return (in part):
781.Bd -literal -offset indent
782Components:
783           /dev/sd1e: optimal
784           /dev/sd2e: failed
785           /dev/sd3e: optimal
786Spares:
787           /dev/sd4e: spare
788.Ed
789.Pp
790Note that with the use of
791.Fl f
792a reconstruction has not been started.  To both fail the disk and
793start a reconstruction, the
794.Fl F
795option must be used:
796.Bd -literal -offset indent
797raidctl -F /dev/sd2e raid0
798.Ed
799.Pp
800The
801.Fl f
802option may be used first, and then the
803.Fl F
804option used later, on the same disk, if desired.
805Immediately after the reconstruction is started, the status will report:
806.Bd -literal -offset indent
807Components:
808           /dev/sd1e: optimal
809           /dev/sd2e: reconstructing
810           /dev/sd3e: optimal
811Spares:
812           /dev/sd4e: used_spare
813[...]
814Parity status: clean
815Reconstruction is 10% complete.
816Parity Re-write is 100% complete.
817Copyback is 100% complete.
818.Ed
819.Pp
820This indicates that a reconstruction is in progress.  To find out how
821the reconstruction is progressing the
822.Fl S
823option may be used.  This will indicate the progress in terms of the
824percentage of the reconstruction that is completed.  When the
825reconstruction is finished the
826.Fl s
827option will show:
828.Bd -literal -offset indent
829Components:
830           /dev/sd1e: optimal
831           /dev/sd2e: spared
832           /dev/sd3e: optimal
833Spares:
834           /dev/sd4e: used_spare
835[...]
836Parity status: clean
837Reconstruction is 100% complete.
838Parity Re-write is 100% complete.
839Copyback is 100% complete.
840.Ed
841.Pp
842At this point there are at least two options.  First, if /dev/sd2e is
843known to be good (i.e. the failure was either caused by
844.Fl f
845or
846.Fl F ,
847or the failed disk was replaced), then a copyback of the data can
848be initiated with the
849.Fl B
850option.  In this example, this would copy the entire contents of
851/dev/sd4e to /dev/sd2e.  Once the copyback procedure is complete, the
852status of the device would be (in part):
853.Bd -literal -offset indent
854Components:
855           /dev/sd1e: optimal
856           /dev/sd2e: optimal
857           /dev/sd3e: optimal
858Spares:
859           /dev/sd4e: spare
860.Ed
861.Pp
862and the system is back to normal operation.
863.Pp
864The second option after the reconstruction is to simply use /dev/sd4e
865in place of /dev/sd2e in the configuration file.  For example, the
866configuration file (in part) might now look like:
867.Bd -literal -offset indent
868START array
8691 3 0
870
871START drives
872/dev/sd1e
873/dev/sd4e
874/dev/sd3e
875.Ed
876.Pp
877This can be done as /dev/sd4e is completely interchangeable with
878/dev/sd2e at this point.  Note that extreme care must be taken when
879changing the order of the drives in a configuration.  This is one of
880the few instances where the devices and/or their orderings can be
881changed without loss of data!  In general, the ordering of components
882in a configuration file should
883.Ar never
884be changed.
885.Pp
886If a component fails and there are no hot spares
887available on-line, the status of the RAID set might (in part) look like:
888.Bd -literal -offset indent
889Components:
890           /dev/sd1e: optimal
891           /dev/sd2e: failed
892           /dev/sd3e: optimal
893No spares.
894.Ed
895.Pp
896In this case there are a number of options.  The first option is to add a hot
897spare using:
898.Bd -literal -offset indent
899raidctl -a /dev/sd4e raid0
900.Ed
901.Pp
902After the hot add, the status would then be:
903.Bd -literal -offset indent
904Components:
905           /dev/sd1e: optimal
906           /dev/sd2e: failed
907           /dev/sd3e: optimal
908Spares:
909           /dev/sd4e: spare
910.Ed
911.Pp
912Reconstruction could then take place using
913.Fl F
914as describe above.
915.Pp
916A second option is to rebuild directly onto /dev/sd2e.  Once the disk
917containing /dev/sd2e has been replaced, one can simply use:
918.Bd -literal -offset indent
919raidctl -R /dev/sd2e raid0
920.Ed
921.Pp
922to rebuild the /dev/sd2e component.  As the rebuilding is in progress,
923the status will be:
924.Bd -literal -offset indent
925Components:
926           /dev/sd1e: optimal
927           /dev/sd2e: reconstructing
928           /dev/sd3e: optimal
929No spares.
930.Ed
931.Pp
932and when completed, will be:
933.Bd -literal -offset indent
934Components:
935           /dev/sd1e: optimal
936           /dev/sd2e: optimal
937           /dev/sd3e: optimal
938No spares.
939.Ed
940.Pp
941In circumstances where a particular component is completely
942unavailable after a reboot, a special component name will be used to
943indicate the missing component.  For example:
944.Bd -literal -offset indent
945Components:
946           /dev/sd2e: optimal
947          component1: failed
948No spares.
949.Ed
950.Pp
951indicates that the second component of this RAID set was not detected
952at all by the auto-configuration code.  The name
953.Sq component1
954can be used anywhere a normal component name would be used.  For
955example, to add a hot spare to the above set, and rebuild to that hot
956spare, the following could be done:
957.Bd -literal -offset indent
958raidctl -a /dev/sd3e raid0
959raidctl -F component1 raid0
960.Ed
961.Pp
962at which point the data missing from
963.Sq component1
964would be reconstructed onto /dev/sd3e.
965.Pp
966When more than one component is marked as
967.Sq failed
968due to a non-component hardware failure (e.g. loss of power to two
969components, adapter problems, termination problems, or cabling issues) it
970is quite possible to recover the data on the RAID set.  The first
971thing to be aware of is that the first disk to fail will almost certainly
972be out-of-sync with the remainder of the array.  If any IO was
973performed between the time the first component is considered
974.Sq failed
975and when the second component is considered
976.Sq failed ,
977then the first component to fail will
978.Ar not
979contain correct data, and should be ignored.  When the second
980component is marked as failed, however, the RAID device will
981(currently) panic the system.  At this point the data on the RAID set
982(not including the first failed component) is still self consistent,
983and will be in no worse state of repair than had the power gone out in
984the middle of a write to a filesystem on a non-RAID device.
985The problem, however, is that the component labels may now have 3
986different 'modification counters' (one value on the first component
987that failed, one value on the second component that failed, and a
988third value on the remaining components).  In such a situation, the
989RAID set will not autoconfigure, and can only be forcibly re-configured
990with the
991.Fl C
992option.  To recover the RAID set, one must first remedy whatever physical
993problem caused the multiple-component failure.  After that is done,
994the RAID set can be restored by forcibly configuring the raid set
995.Ar without
996the component that failed first.  For example, if /dev/sd1e and
997/dev/sd2e fail (in that order) in a RAID set of the following
998configuration:
999.Bd -literal -offset indent
1000START array
10011 4 0
1002
1003START drives
1004/dev/sd1e
1005/dev/sd2e
1006/dev/sd3e
1007/dev/sd4e
1008
1009START layout
1010# sectPerSU SUsPerParityUnit SUsPerReconUnit RAID_level_5
101164 1 1 5
1012
1013START queue
1014fifo 100
1015
1016.Ed
1017.Pp
1018then the following configuration (say "recover_raid0.conf")
1019.Bd -literal -offset indent
1020START array
10211 4 0
1022
1023START drives
1024/dev/sd6e
1025/dev/sd2e
1026/dev/sd3e
1027/dev/sd4e
1028
1029START layout
1030# sectPerSU SUsPerParityUnit SUsPerReconUnit RAID_level_5
103164 1 1 5
1032
1033START queue
1034fifo 100
1035.Ed
1036.Pp
1037(where /dev/sd6e has no physical device) can be used with
1038.Bd -literal -offset indent
1039raidctl -C recover_raid0.conf raid0
1040.Ed
1041.Pp
1042to force the configuration of raid0.  A
1043.Bd -literal -offset indent
1044raidctl -I 12345 raid0
1045.Ed
1046.Pp
1047will be required in order to synchronize the component labels.
1048At this point the filesystems on the RAID set can then be checked and
1049corrected.  To complete the re-construction of the RAID set,
1050/dev/sd1e is simply hot-added back into the array, and reconstructed
1051as described earlier.
1052.Ss RAID on RAID
1053RAID sets can be layered to create more complex and much larger RAID
1054sets.  A RAID 0 set, for example, could be constructed from four RAID
10555 sets.  The following configuration file shows such a setup:
1056.Bd -literal -offset indent
1057START array
1058# numRow numCol numSpare
10591 4 0
1060
1061START disks
1062/dev/raid1e
1063/dev/raid2e
1064/dev/raid3e
1065/dev/raid4e
1066
1067START layout
1068# sectPerSU SUsPerParityUnit SUsPerReconUnit RAID_level_0
1069128 1 1 0
1070
1071START queue
1072fifo 100
1073.Ed
1074.Pp
1075A similar configuration file might be used for a RAID 0 set
1076constructed from components on RAID 1 sets.  In such a configuration,
1077the mirroring provides a high degree of redundancy, while the striping
1078provides additional speed benefits.
1079.Ss Auto-configuration and Root on RAID
1080RAID sets can also be auto-configured at boot.  To make a set
1081auto-configurable, simply prepare the RAID set as above, and then do
1082a:
1083.Bd -literal -offset indent
1084raidctl -A yes raid0
1085.Ed
1086.Pp
1087to turn on auto-configuration for that set.  To turn off
1088auto-configuration, use:
1089.Bd -literal -offset indent
1090raidctl -A no raid0
1091.Ed
1092.Pp
1093RAID sets which are auto-configurable will be configured before the
1094root file system is mounted.  These RAID sets are thus available for
1095use as a root file system, or for any other file system.  A primary
1096advantage of using the auto-configuration is that RAID components
1097become more independent of the disks they reside on.  For example,
1098SCSI ID's can change, but auto-configured sets will always be
1099configured correctly, even if the SCSI ID's of the component disks
1100have become scrambled.
1101.Pp
1102Having a system's root file system
1103.Pq Pa /
1104on a RAID set is also allowed,
1105with the
1106.Sq a
1107partition of such a RAID set being used for
1108.Pa / .
1109To use raid0a as the root file system, simply use:
1110.Bd -literal -offset indent
1111raidctl -A root raid0
1112.Ed
1113.Pp
1114To return raid0a to be just an auto-configuring set simply use the
1115.Fl A Ar yes
1116arguments.
1117.Pp
1118Note that kernels can only be directly read from RAID 1 components on
1119alpha and pmax architectures.  On those architectures, the
1120.Dv FS_RAID
1121file system is recognized by the bootblocks, and will properly load the
1122kernel directly from a RAID 1 component.  For other architectures, or
1123to support the root file system on other RAID sets, some other
1124mechanism must be used to get a kernel booting.  For example, a small
1125partition containing only the secondary boot-blocks and an alternate
1126kernel (or two) could be used.  Once a kernel is booting however, and
1127an auto-configuring RAID set is found that is eligible to be root,
1128then that RAID set will be auto-configured and used as the root
1129device.  If two or more RAID sets claim to be root devices, then the
1130user will be prompted to select the root device.  At this time, RAID
11310, 1, 4, and 5 sets are all supported as root devices.
1132.Pp
1133A typical RAID 1 setup with root on RAID might be as follows:
1134.Bl -enum
1135.It
1136wd0a - a small partition, which contains a complete, bootable, basic
1137.Nx
1138installation.
1139.It
1140wd1a - also contains a complete, bootable, basic
1141.Nx
1142installation.
1143.It
1144wd0e and wd1e - a RAID 1 set, raid0, used for the root file system.
1145.It
1146wd0f and wd1f - a RAID 1 set, raid1, which will be used only for
1147swap space.
1148.It
1149wd0g and wd1g - a RAID 1 set, raid2, used for
1150.Pa /usr ,
1151.Pa /home ,
1152or other data, if desired.
1153.It
1154wd0h and wd0h - a RAID 1 set, raid3, if desired.
1155.El
1156.Pp
1157RAID sets raid0, raid1, and raid2 are all marked as
1158auto-configurable.  raid0 is marked as being a root file system.
1159When new kernels are installed, the kernel is not only copied to
1160.Pa / ,
1161but also to wd0a and wd1a.  The kernel on wd0a is required, since that
1162is the kernel the system boots from.  The kernel on wd1a is also
1163required, since that will be the kernel used should wd0 fail.  The
1164important point here is to have redundant copies of the kernel
1165available, in the event that one of the drives fail.
1166.Pp
1167There is no requirement that the root file system be on the same disk
1168as the kernel.  For example, obtaining the kernel from wd0a, and using
1169sd0e and sd1e for raid0, and the root file system, is fine.  It
1170.Ar is
1171critical, however, that there be multiple kernels available, in the
1172event of media failure.
1173.Pp
1174Multi-layered RAID devices (such as a RAID 0 set made
1175up of RAID 1 sets) are
1176.Ar not
1177supported as root devices or auto-configurable devices at this point.
1178(Multi-layered RAID devices
1179.Ar are
1180supported in general, however, as mentioned earlier.)  Note that in
1181order to enable component auto-detection and auto-configuration of
1182RAID devices, the line:
1183.Bd -literal -offset indent
1184options    RAID_AUTOCONFIG
1185.Ed
1186.Pp
1187must be in the kernel configuration file.  See
1188.Xr raid 4
1189for more details.
1190.Ss Unconfiguration
1191The final operation performed by
1192.Nm
1193is to unconfigure a
1194.Xr raid 4
1195device.  This is accomplished via a simple:
1196.Bd -literal -offset indent
1197raidctl -u raid0
1198.Ed
1199.Pp
1200at which point the device is ready to be reconfigured.
1201.Ss Performance Tuning
1202Selection of the various parameter values which result in the best
1203performance can be quite tricky, and often requires a bit of
1204trial-and-error to get those values most appropriate for a given system.
1205A whole range of factors come into play, including:
1206.Bl -enum
1207.It
1208Types of components (e.g. SCSI vs. IDE) and their bandwidth
1209.It
1210Types of controller cards and their bandwidth
1211.It
1212Distribution of components among controllers
1213.It
1214IO bandwidth
1215.It
1216file system access patterns
1217.It
1218CPU speed
1219.El
1220.Pp
1221As with most performance tuning, benchmarking under real-life loads
1222may be the only way to measure expected performance.  Understanding
1223some of the underlying technology is also useful in tuning.  The goal
1224of this section is to provide pointers to those parameters which may
1225make significant differences in performance.
1226.Pp
1227For a RAID 1 set, a SectPerSU value of 64 or 128 is typically
1228sufficient.  Since data in a RAID 1 set is arranged in a linear
1229fashion on each component, selecting an appropriate stripe size is
1230somewhat less critical than it is for a RAID 5 set.  However: a stripe
1231size that is too small will cause large IO's to be broken up into a
1232number of smaller ones, hurting performance.  At the same time, a
1233large stripe size may cause problems with concurrent accesses to
1234stripes, which may also affect performance.  Thus values in the range
1235of 32 to 128 are often the most effective.
1236.Pp
1237Tuning RAID 5 sets is trickier.  In the best case, IO is presented to
1238the RAID set one stripe at a time.  Since the entire stripe is
1239available at the beginning of the IO, the parity of that stripe can
1240be calculated before the stripe is written, and then the stripe data
1241and parity can be written in parallel.  When the amount of data being
1242written is less than a full stripe worth, the
1243.Sq small write
1244problem occurs.  Since a
1245.Sq small write
1246means only a portion of the stripe on the components is going to
1247change, the data (and parity) on the components must be updated
1248slightly differently.  First, the
1249.Sq old parity
1250and
1251.Sq old data
1252must be read from the components.  Then the new parity is constructed,
1253using the new data to be written, and the old data and old parity.
1254Finally, the new data and new parity are written.  All this extra data
1255shuffling results in a serious loss of performance, and is typically 2
1256to 4 times slower than a full stripe write (or read).  To combat this
1257problem in the real world, it may be useful to ensure that stripe
1258sizes are small enough that a
1259.Sq large IO
1260from the system will use exactly one large stripe write. As is seen
1261later, there are some file system dependencies which may come into play
1262here as well.
1263.Pp
1264Since the size of a
1265.Sq large IO
1266is often (currently) only 32K or 64K, on a 5-drive RAID 5 set it may
1267be desirable to select a SectPerSU value of 16 blocks (8K) or 32
1268blocks (16K).  Since there are 4 data sectors per stripe, the maximum
1269data per stripe is 64 blocks (32K) or 128 blocks (64K).  Again,
1270empirical measurement will provide the best indicators of which
1271values will yeild better performance.
1272.Pp
1273The parameters used for the file system are also critical to good
1274performance.  For
1275.Xr newfs 8 ,
1276for example, increasing the block size to 32K or 64K may improve
1277performance dramatically.  As well, changing the cylinders-per-group
1278parameter from 16 to 32 or higher is often not only necessary for
1279larger file systems, but may also have positive performance
1280implications.
1281.Ss Summary
1282Despite the length of this man-page, configuring a RAID set is a
1283relatively straight-forward process.  All that needs to be done is the
1284following steps:
1285.Bl -enum
1286.It
1287Use
1288.Xr disklabel 8
1289to create the components (of type RAID).
1290.It
1291Construct a RAID configuration file: e.g.
1292.Sq raid0.conf
1293.It
1294Configure the RAID set with:
1295.Bd -literal -offset indent
1296raidctl -C raid0.conf raid0
1297.Ed
1298.Pp
1299.It
1300Initialize the component labels with:
1301.Bd -literal -offset indent
1302raidctl -I 123456 raid0
1303.Ed
1304.Pp
1305.It
1306Initialize other important parts of the set with:
1307.Bd -literal -offset indent
1308raidctl -i raid0
1309.Ed
1310.Pp
1311.It
1312Get the default label for the RAID set:
1313.Bd -literal -offset indent
1314disklabel raid0 \*[Gt] /tmp/label
1315.Ed
1316.Pp
1317.It
1318Edit the label:
1319.Bd -literal -offset indent
1320vi /tmp/label
1321.Ed
1322.Pp
1323.It
1324Put the new label on the RAID set:
1325.Bd -literal -offset indent
1326disklabel -R -r raid0 /tmp/label
1327.Ed
1328.Pp
1329.It
1330Create the file system:
1331.Bd -literal -offset indent
1332newfs /dev/rraid0e
1333.Ed
1334.Pp
1335.It
1336Mount the file system:
1337.Bd -literal -offset indent
1338mount /dev/raid0e /mnt
1339.Ed
1340.Pp
1341.It
1342Use:
1343.Bd -literal -offset indent
1344raidctl -c raid0.conf raid0
1345.Ed
1346.Pp
1347To re-configure the RAID set the next time it is needed, or put
1348raid0.conf into /etc where it will automatically be started by
1349the /etc/rc scripts.
1350.El
1351.Sh SEE ALSO
1352.Xr ccd 4 ,
1353.Xr raid 4 ,
1354.Xr rc 8
1355.Sh HISTORY
1356RAIDframe is a framework for rapid prototyping of RAID structures
1357developed by the folks at the Parallel Data Laboratory at Carnegie
1358Mellon University (CMU).
1359A more complete description of the internals and functionality of
1360RAIDframe is found in the paper "RAIDframe: A Rapid Prototyping Tool
1361for RAID Systems", by William V. Courtright II, Garth Gibson, Mark
1362Holland, LeAnn Neal Reilly, and Jim Zelenka, and published by the
1363Parallel Data Laboratory of Carnegie Mellon University.
1364.Pp
1365The
1366.Nm
1367command first appeared as a program in CMU's RAIDframe v1.1 distribution.  This
1368version of
1369.Nm
1370is a complete re-write, and first appeared in
1371.Nx 1.4 .
1372.Sh COPYRIGHT
1373.Bd -literal
1374The RAIDframe Copyright is as follows:
1375
1376Copyright (c) 1994-1996 Carnegie-Mellon University.
1377All rights reserved.
1378
1379Permission to use, copy, modify and distribute this software and
1380its documentation is hereby granted, provided that both the copyright
1381notice and this permission notice appear in all copies of the
1382software, derivative works or modified versions, and any portions
1383thereof, and that both notices appear in supporting documentation.
1384
1385CARNEGIE MELLON ALLOWS FREE USE OF THIS SOFTWARE IN ITS "AS IS"
1386CONDITION.  CARNEGIE MELLON DISCLAIMS ANY LIABILITY OF ANY KIND
1387FOR ANY DAMAGES WHATSOEVER RESULTING FROM THE USE OF THIS SOFTWARE.
1388
1389Carnegie Mellon requests users of this software to return to
1390
1391 Software Distribution Coordinator  or  Software.Distribution@CS.CMU.EDU
1392 School of Computer Science
1393 Carnegie Mellon University
1394 Pittsburgh PA 15213-3890
1395
1396any improvements or extensions that they make and grant Carnegie the
1397rights to redistribute these changes.
1398.Ed
1399.Sh WARNINGS
1400Certain RAID levels (1, 4, 5, 6, and others) can protect against some
1401data loss due to component failure.  However the loss of two
1402components of a RAID 4 or 5 system, or the loss of a single component
1403of a RAID 0 system will result in the entire file system being lost.
1404RAID is
1405.Ar NOT
1406a substitute for good backup practices.
1407.Pp
1408Recomputation of parity
1409.Ar MUST
1410be performed whenever there is a chance that it may have been
1411compromised.  This includes after system crashes, or before a RAID
1412device has been used for the first time.  Failure to keep parity
1413correct will be catastrophic should a component ever fail -- it is
1414better to use RAID 0 and get the additional space and speed, than it
1415is to use parity, but not keep the parity correct.  At least with RAID
14160 there is no perception of increased data security.
1417.Sh BUGS
1418Hot-spare removal is currently not available.
1419