1.\" $NetBSD: raidctl.8,v 1.29 2002/02/08 01:30:45 ross Exp $ 2.\" 3.\" Copyright (c) 1998 The NetBSD Foundation, Inc. 4.\" All rights reserved. 5.\" 6.\" This code is derived from software contributed to The NetBSD Foundation 7.\" by Greg Oster 8.\" 9.\" Redistribution and use in source and binary forms, with or without 10.\" modification, are permitted provided that the following conditions 11.\" are met: 12.\" 1. Redistributions of source code must retain the above copyright 13.\" notice, this list of conditions and the following disclaimer. 14.\" 2. Redistributions in binary form must reproduce the above copyright 15.\" notice, this list of conditions and the following disclaimer in the 16.\" documentation and/or other materials provided with the distribution. 17.\" 3. All advertising materials mentioning features or use of this software 18.\" must display the following acknowledgement: 19.\" This product includes software developed by the NetBSD 20.\" Foundation, Inc. and its contributors. 21.\" 4. Neither the name of The NetBSD Foundation nor the names of its 22.\" contributors may be used to endorse or promote products derived 23.\" from this software without specific prior written permission. 24.\" 25.\" THIS SOFTWARE IS PROVIDED BY THE NETBSD FOUNDATION, INC. AND CONTRIBUTORS 26.\" ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED 27.\" TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR 28.\" PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE FOUNDATION OR CONTRIBUTORS 29.\" BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR 30.\" CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF 31.\" SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS 32.\" INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN 33.\" CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) 34.\" ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE 35.\" POSSIBILITY OF SUCH DAMAGE. 36.\" 37.\" 38.\" Copyright (c) 1995 Carnegie-Mellon University. 39.\" All rights reserved. 40.\" 41.\" Author: Mark Holland 42.\" 43.\" Permission to use, copy, modify and distribute this software and 44.\" its documentation is hereby granted, provided that both the copyright 45.\" notice and this permission notice appear in all copies of the 46.\" software, derivative works or modified versions, and any portions 47.\" thereof, and that both notices appear in supporting documentation. 48.\" 49.\" CARNEGIE MELLON ALLOWS FREE USE OF THIS SOFTWARE IN ITS "AS IS" 50.\" CONDITION. CARNEGIE MELLON DISCLAIMS ANY LIABILITY OF ANY KIND 51.\" FOR ANY DAMAGES WHATSOEVER RESULTING FROM THE USE OF THIS SOFTWARE. 52.\" 53.\" Carnegie Mellon requests users of this software to return to 54.\" 55.\" Software Distribution Coordinator or Software.Distribution@CS.CMU.EDU 56.\" School of Computer Science 57.\" Carnegie Mellon University 58.\" Pittsburgh PA 15213-3890 59.\" 60.\" any improvements or extensions that they make and grant Carnegie the 61.\" rights to redistribute these changes. 62.\" 63.Dd July 10, 2001 64.Dt RAIDCTL 8 65.Os 66.Sh NAME 67.Nm raidctl 68.Nd configuration utility for the RAIDframe disk driver 69.Sh SYNOPSIS 70.Nm "" 71.Op Fl v 72.Fl a Ar component Ar dev 73.Nm "" 74.Op Fl v 75.Fl A Op yes | no | root 76.Ar dev 77.Nm "" 78.Op Fl v 79.Fl B Ar dev 80.Nm "" 81.Op Fl v 82.Fl c Ar config_file Ar dev 83.Nm "" 84.Op Fl v 85.Fl C Ar config_file Ar dev 86.Nm "" 87.Op Fl v 88.Fl f Ar component Ar dev 89.Nm "" 90.Op Fl v 91.Fl F Ar component Ar dev 92.Nm "" 93.Op Fl v 94.Fl g Ar component Ar dev 95.Nm "" 96.Op Fl v 97.Fl G Ar dev 98.Nm "" 99.Op Fl v 100.Fl i Ar dev 101.Nm "" 102.Op Fl v 103.Fl I Ar serial_number Ar dev 104.Nm "" 105.Op Fl v 106.Fl p Ar dev 107.Nm "" 108.Op Fl v 109.Fl P Ar dev 110.Nm "" 111.Op Fl v 112.Fl r Ar component Ar dev 113.Nm "" 114.Op Fl v 115.Fl R Ar component Ar dev 116.Nm "" 117.Op Fl v 118.Fl s Ar dev 119.Nm "" 120.Op Fl v 121.Fl S Ar dev 122.Nm "" 123.Op Fl v 124.Fl u Ar dev 125.Sh DESCRIPTION 126.Nm "" 127is the user-land control program for 128.Xr raid 4 , 129the RAIDframe disk device. 130.Nm "" 131is primarily used to dynamically configure and unconfigure RAIDframe disk 132devices. For more information about the RAIDframe disk device, see 133.Xr raid 4 . 134.Pp 135This document assumes the reader has at least rudimentary knowledge of 136RAID and RAID concepts. 137.Pp 138The command-line options for 139.Nm 140are as follows: 141.Bl -tag -width indent 142.It Fl a Ar component Ar dev 143Add 144.Ar component 145as a hot spare for the device 146.Ar dev . 147.It Fl A Ic yes Ar dev 148Make the RAID set auto-configurable. The RAID set will be 149automatically configured at boot 150.Ar before 151the root file system is 152mounted. Note that all components of the set must be of type RAID in the 153disklabel. 154.It Fl A Ic no Ar dev 155Turn off auto-configuration for the RAID set. 156.It Fl A Ic root Ar dev 157Make the RAID set auto-configurable, and also mark the set as being 158eligible to be the root partition. A RAID set configured this way 159will 160.Ar override 161the use of the boot disk as the root device. All components of the 162set must be of type RAID in the disklabel. Note that the kernel being 163booted must currently reside on a non-RAID set. 164.It Fl B Ar dev 165Initiate a copyback of reconstructed data from a spare disk to 166its original disk. This is performed after a component has failed, 167and the failed drive has been reconstructed onto a spare drive. 168.It Fl c Ar config_file Ar dev 169Configure the RAIDframe device 170.Ar dev 171according to the configuration given in 172.Ar config_file . 173A description of the contents of 174.Ar config_file 175is given later. 176.It Fl C Ar config_file Ar dev 177As for 178.Ar -c , 179but forces the configuration to take place. This is required the 180first time a RAID set is configured. 181.It Fl f Ar component Ar dev 182This marks the specified 183.Ar component 184as having failed, but does not initiate a reconstruction of that 185component. 186.It Fl F Ar component Ar dev 187Fails the specified 188.Ar component 189of the device, and immediately begin a reconstruction of the failed 190disk onto an available hot spare. This is one of the mechanisms used to start 191the reconstruction process if a component does have a hardware failure. 192.It Fl g Ar component Ar dev 193Get the component label for the specified component. 194.It Fl G Ar dev 195Generate the configuration of the RAIDframe device in a format suitable for 196use with 197.Nm 198.Fl c 199or 200.Fl C . 201.It Fl i Ar dev 202Initialize the RAID device. In particular, (re-write) the parity on 203the selected device. This 204.Ar MUST 205be done for 206.Ar all 207RAID sets before the RAID device is labeled and before 208file systems are created on the RAID device. 209.It Fl I Ar serial_number Ar dev 210Initialize the component labels on each component of the device. 211.Ar serial_number 212is used as one of the keys in determining whether a 213particular set of components belong to the same RAID set. While not 214strictly enforced, different serial numbers should be used for 215different RAID sets. This step 216.Ar MUST 217be performed when a new RAID set is created. 218.It Fl p Ar dev 219Check the status of the parity on the RAID set. Displays a status 220message, and returns successfully if the parity is up-to-date. 221.It Fl P Ar dev 222Check the status of the parity on the RAID set, and initialize 223(re-write) the parity if the parity is not known to be up-to-date. 224This is normally used after a system crash (and before a 225.Xr fsck 8 ) 226to ensure the integrity of the parity. 227.It Fl r Ar component Ar dev 228Remove the spare disk specified by 229.Ar component 230from the set of available spare components. 231.It Fl R Ar component Ar dev 232Fails the specified 233.Ar component , 234if necessary, and immediately begins a reconstruction back to 235.Ar component . 236This is useful for reconstructing back onto a component after 237it has been replaced following a failure. 238.It Fl s Ar dev 239Display the status of the RAIDframe device for each of the components 240and spares. 241.It Fl S Ar dev 242Check the status of parity re-writing, component reconstruction, and 243component copyback. The output indicates the amount of progress 244achieved in each of these areas. 245.It Fl u Ar dev 246Unconfigure the RAIDframe device. 247.It Fl v 248Be more verbose. For operations such as reconstructions, parity 249re-writing, and copybacks, provide a progress indicator. 250.El 251.Pp 252The device used by 253.Nm 254is specified by 255.Ar dev . 256.Ar dev 257may be either the full name of the device, e.g. /dev/rraid0d, 258for the i386 architecture, and /dev/rraid0c 259for all others, or just simply raid0 (for /dev/rraid0d). 260.Ss Configuration file 261The format of the configuration file is complex, and 262only an abbreviated treatment is given here. In the configuration 263files, a 264.Sq # 265indicates the beginning of a comment. 266.Pp 267There are 4 required sections of a configuration file, and 2 268optional sections. Each section begins with a 269.Sq START , 270followed by 271the section name, and the configuration parameters associated with that 272section. The first section is the 273.Sq array 274section, and it specifies 275the number of rows, columns, and spare disks in the RAID set. For 276example: 277.Bd -literal -offset indent 278START array 2791 3 0 280.Ed 281.Pp 282indicates an array with 1 row, 3 columns, and 0 spare disks. Note 283that although multi-dimensional arrays may be specified, they are 284.Ar NOT 285supported in the driver. 286.Pp 287The second section, the 288.Sq disks 289section, specifies the actual 290components of the device. For example: 291.Bd -literal -offset indent 292START disks 293/dev/sd0e 294/dev/sd1e 295/dev/sd2e 296.Ed 297.Pp 298specifies the three component disks to be used in the RAID device. If 299any of the specified drives cannot be found when the RAID device is 300configured, then they will be marked as 301.Sq failed , 302and the system will 303operate in degraded mode. Note that it is 304.Ar imperative 305that the order of the components in the configuration file does not 306change between configurations of a RAID device. Changing the order 307of the components will result in data loss if the set is configured 308with the 309.Fl C 310option. In normal circumstances, the RAID set will not configure if 311only 312.Fl c 313is specified, and the components are out-of-order. 314.Pp 315The next section, which is the 316.Sq spare 317section, is optional, and, if 318present, specifies the devices to be used as 319.Sq hot spares 320-- devices 321which are on-line, but are not actively used by the RAID driver unless 322one of the main components fail. A simple 323.Sq spare 324section might be: 325.Bd -literal -offset indent 326START spare 327/dev/sd3e 328.Ed 329.Pp 330for a configuration with a single spare component. If no spare drives 331are to be used in the configuration, then the 332.Sq spare 333section may be omitted. 334.Pp 335The next section is the 336.Sq layout 337section. This section describes the 338general layout parameters for the RAID device, and provides such 339information as sectors per stripe unit, stripe units per parity unit, 340stripe units per reconstruction unit, and the parity configuration to 341use. This section might look like: 342.Bd -literal -offset indent 343START layout 344# sectPerSU SUsPerParityUnit SUsPerReconUnit RAID_level 34532 1 1 5 346.Ed 347.Pp 348The sectors per stripe unit specifies, in blocks, the interleave 349factor; i.e. the number of contiguous sectors to be written to each 350component for a single stripe. Appropriate selection of this value 351(32 in this example) is the subject of much research in RAID 352architectures. The stripe units per parity unit and 353stripe units per reconstruction unit are normally each set to 1. 354While certain values above 1 are permitted, a discussion of valid 355values and the consequences of using anything other than 1 are outside 356the scope of this document. The last value in this section (5 in this 357example) indicates the parity configuration desired. Valid entries 358include: 359.Bl -tag -width inde 360.It 0 361RAID level 0. No parity, only simple striping. 362.It 1 363RAID level 1. Mirroring. The parity is the mirror. 364.It 4 365RAID level 4. Striping across components, with parity stored on the 366last component. 367.It 5 368RAID level 5. Striping across components, parity distributed across 369all components. 370.El 371.Pp 372There are other valid entries here, including those for Even-Odd 373parity, RAID level 5 with rotated sparing, Chained declustering, 374and Interleaved declustering, but as of this writing the code for 375those parity operations has not been tested with 376.Nx . 377.Pp 378The next required section is the 379.Sq queue 380section. This is most often 381specified as: 382.Bd -literal -offset indent 383START queue 384fifo 100 385.Ed 386.Pp 387where the queuing method is specified as fifo (first-in, first-out), 388and the size of the per-component queue is limited to 100 requests. 389Other queuing methods may also be specified, but a discussion of them 390is beyond the scope of this document. 391.Pp 392The final section, the 393.Sq debug 394section, is optional. For more details 395on this the reader is referred to the RAIDframe documentation 396discussed in the 397.Sx HISTORY 398section. 399.Pp 400See 401.Sx EXAMPLES 402for a more complete configuration file example. 403.Sh FILES 404.Bl -tag -width /dev/XXrXraidX -compact 405.It Pa /dev/{,r}raid* 406.Cm raid 407device special files. 408.El 409.Sh EXAMPLES 410It is highly recommended that before using the RAID driver for real 411file systems that the system administrator(s) become quite familiar 412with the use of 413.Nm "" , 414and that they understand how the component reconstruction process 415works. The examples in this section will focus on configuring a 416number of different RAID sets of varying degrees of redundancy. 417By working through these examples, administrators should be able to 418develop a good feel for how to configure a RAID set, and how to 419initiate reconstruction of failed components. 420.Pp 421In the following examples 422.Sq raid0 423will be used to denote the RAID device. Depending on the 424architecture, 425.Sq /dev/rraid0c 426or 427.Sq /dev/rraid0d 428may be used in place of 429.Sq raid0 . 430.Ss Initialization and Configuration 431The initial step in configuring a RAID set is to identify the components 432that will be used in the RAID set. All components should be the same 433size. Each component should have a disklabel type of 434.Dv FS_RAID , 435and a typical disklabel entry for a RAID component 436might look like: 437.Bd -literal -offset indent 438f: 1800000 200495 RAID # (Cyl. 405*- 4041*) 439.Ed 440.Pp 441While 442.Dv FS_BSDFFS 443will also work as the component type, the type 444.Dv FS_RAID 445is preferred for RAIDframe use, as it is required for features such as 446auto-configuration. As part of the initial configuration of each RAID 447set, each component will be given a 448.Sq component label . 449A 450.Sq component label 451contains important information about the component, including a 452user-specified serial number, the row and column of that component in 453the RAID set, the redundancy level of the RAID set, a 'modification 454counter', and whether the parity information (if any) on that 455component is known to be correct. Component labels are an integral 456part of the RAID set, since they are used to ensure that components 457are configured in the correct order, and used to keep track of other 458vital information about the RAID set. Component labels are also 459required for the auto-detection and auto-configuration of RAID sets at 460boot time. For a component label to be considered valid, that 461particular component label must be in agreement with the other 462component labels in the set. For example, the serial number, 463.Sq modification counter , 464number of rows and number of columns must all 465be in agreement. If any of these are different, then the component is 466not considered to be part of the set. See 467.Xr raid 4 468for more information about component labels. 469.Pp 470Once the components have been identified, and the disks have 471appropriate labels, 472.Nm "" 473is then used to configure the 474.Xr raid 4 475device. To configure the device, a configuration 476file which looks something like: 477.Bd -literal -offset indent 478START array 479# numRow numCol numSpare 4801 3 1 481 482START disks 483/dev/sd1e 484/dev/sd2e 485/dev/sd3e 486 487START spare 488/dev/sd4e 489 490START layout 491# sectPerSU SUsPerParityUnit SUsPerReconUnit RAID_level_5 49232 1 1 5 493 494START queue 495fifo 100 496.Ed 497.Pp 498is created in a file. The above configuration file specifies a RAID 5 499set consisting of the components /dev/sd1e, /dev/sd2e, and /dev/sd3e, 500with /dev/sd4e available as a 501.Sq hot spare 502in case one of 503the three main drives should fail. A RAID 0 set would be specified in 504a similar way: 505.Bd -literal -offset indent 506START array 507# numRow numCol numSpare 5081 4 0 509 510START disks 511/dev/sd10e 512/dev/sd11e 513/dev/sd12e 514/dev/sd13e 515 516START layout 517# sectPerSU SUsPerParityUnit SUsPerReconUnit RAID_level_0 51864 1 1 0 519 520START queue 521fifo 100 522.Ed 523.Pp 524In this case, devices /dev/sd10e, /dev/sd11e, /dev/sd12e, and /dev/sd13e 525are the components that make up this RAID set. Note that there are no 526hot spares for a RAID 0 set, since there is no way to recover data if 527any of the components fail. 528.Pp 529For a RAID 1 (mirror) set, the following configuration might be used: 530.Bd -literal -offset indent 531START array 532# numRow numCol numSpare 5331 2 0 534 535START disks 536/dev/sd20e 537/dev/sd21e 538 539START layout 540# sectPerSU SUsPerParityUnit SUsPerReconUnit RAID_level_1 541128 1 1 1 542 543START queue 544fifo 100 545.Ed 546.Pp 547In this case, /dev/sd20e and /dev/sd21e are the two components of the 548mirror set. While no hot spares have been specified in this 549configuration, they easily could be, just as they were specified in 550the RAID 5 case above. Note as well that RAID 1 sets are currently 551limited to only 2 components. At present, n-way mirroring is not 552possible. 553.Pp 554The first time a RAID set is configured, the 555.Fl C 556option must be used: 557.Bd -literal -offset indent 558raidctl -C raid0.conf raid0 559.Ed 560.Pp 561where 562.Sq raid0.conf 563is the name of the RAID configuration file. The 564.Fl C 565forces the configuration to succeed, even if any of the component 566labels are incorrect. The 567.Fl C 568option should not be used lightly in 569situations other than initial configurations, as if 570the system is refusing to configure a RAID set, there is probably a 571very good reason for it. After the initial configuration is done (and 572appropriate component labels are added with the 573.Fl I 574option) then raid0 can be configured normally with: 575.Bd -literal -offset indent 576raidctl -c raid0.conf raid0 577.Ed 578.Pp 579When the RAID set is configured for the first time, it is 580necessary to initialize the component labels, and to initialize the 581parity on the RAID set. Initializing the component labels is done with: 582.Bd -literal -offset indent 583raidctl -I 112341 raid0 584.Ed 585.Pp 586where 587.Sq 112341 588is a user-specified serial number for the RAID set. This 589initialization step is 590.Ar required 591for all RAID sets. As well, using different 592serial numbers between RAID sets is 593.Ar strongly encouraged , 594as using the same serial number for all RAID sets will only serve to 595decrease the usefulness of the component label checking. 596.Pp 597Initializing the RAID set is done via the 598.Fl i 599option. This initialization 600.Ar MUST 601be done for 602.Ar all 603RAID sets, since among other things it verifies that the parity (if 604any) on the RAID set is correct. Since this initialization may be 605quite time-consuming, the 606.Fl v 607option may be also used in conjunction with 608.Fl i : 609.Bd -literal -offset indent 610raidctl -iv raid0 611.Ed 612.Pp 613This will give more verbose output on the 614status of the initialization: 615.Bd -literal -offset indent 616Initiating re-write of parity 617Parity Re-write status: 618 10% |**** | ETA: 06:03 / 619.Ed 620.Pp 621The output provides a 622.Sq Percent Complete 623in both a numeric and graphical format, as well as an estimated time 624to completion of the operation. 625.Pp 626Since it is the parity that provides the 627.Sq redundancy 628part of RAID, it is critical that the parity is correct 629as much as possible. If the parity is not correct, then there is no 630guarantee that data will not be lost if a component fails. 631.Pp 632Once the parity is known to be correct, 633it is then safe to perform 634.Xr disklabel 8 , 635.Xr newfs 8 , 636or 637.Xr fsck 8 638on the device or its file systems, and then to mount the file systems 639for use. 640.Pp 641Under certain circumstances (e.g. the additional component has not 642arrived, or data is being migrated off of a disk destined to become a 643component) it may be desirable to to configure a RAID 1 set with only 644a single component. This can be achieved by configuring the set with 645a physically existing component (as either the first or second 646component) and with a 647.Sq fake 648component. In the following: 649.Bd -literal -offset indent 650START array 651# numRow numCol numSpare 6521 2 0 653 654START disks 655/dev/sd6e 656/dev/sd0e 657 658START layout 659# sectPerSU SUsPerParityUnit SUsPerReconUnit RAID_level_1 660128 1 1 1 661 662START queue 663fifo 100 664.Ed 665.Pp 666/dev/sd0e is the real component, and will be the second disk of a RAID 1 667set. The component /dev/sd6e, which must exist, but have no physical 668device associated with it, is simply used as a placeholder. 669Configuration (using 670.Fl C 671and 672.Fl I Ar 12345 673as above) proceeds normally, but initialization of the RAID set will 674have to wait until all physical components are present. After 675configuration, this set can be used normally, but will be operating 676in degraded mode. Once a second physical component is obtained, it 677can be hot-added, the existing data mirrored, and normal operation 678resumed. 679.Ss Maintenance of the RAID set 680After the parity has been initialized for the first time, the command: 681.Bd -literal -offset indent 682raidctl -p raid0 683.Ed 684.Pp 685can be used to check the current status of the parity. To check the 686parity and rebuild it necessary (for example, after an unclean 687shutdown) the command: 688.Bd -literal -offset indent 689raidctl -P raid0 690.Ed 691.Pp 692is used. Note that re-writing the parity can be done while 693other operations on the RAID set are taking place (e.g. while doing a 694.Xr fsck 8 695on a file system on the RAID set). However: for maximum effectiveness 696of the RAID set, the parity should be known to be correct before any 697data on the set is modified. 698.Pp 699To see how the RAID set is doing, the following command can be used to 700show the RAID set's status: 701.Bd -literal -offset indent 702raidctl -s raid0 703.Ed 704.Pp 705The output will look something like: 706.Bd -literal -offset indent 707Components: 708 /dev/sd1e: optimal 709 /dev/sd2e: optimal 710 /dev/sd3e: optimal 711Spares: 712 /dev/sd4e: spare 713Component label for /dev/sd1e: 714 Row: 0 Column: 0 Num Rows: 1 Num Columns: 3 715 Version: 2 Serial Number: 13432 Mod Counter: 65 716 Clean: No Status: 0 717 sectPerSU: 32 SUsPerPU: 1 SUsPerRU: 1 718 RAID Level: 5 blocksize: 512 numBlocks: 1799936 719 Autoconfig: No 720 Last configured as: raid0 721Component label for /dev/sd2e: 722 Row: 0 Column: 1 Num Rows: 1 Num Columns: 3 723 Version: 2 Serial Number: 13432 Mod Counter: 65 724 Clean: No Status: 0 725 sectPerSU: 32 SUsPerPU: 1 SUsPerRU: 1 726 RAID Level: 5 blocksize: 512 numBlocks: 1799936 727 Autoconfig: No 728 Last configured as: raid0 729Component label for /dev/sd3e: 730 Row: 0 Column: 2 Num Rows: 1 Num Columns: 3 731 Version: 2 Serial Number: 13432 Mod Counter: 65 732 Clean: No Status: 0 733 sectPerSU: 32 SUsPerPU: 1 SUsPerRU: 1 734 RAID Level: 5 blocksize: 512 numBlocks: 1799936 735 Autoconfig: No 736 Last configured as: raid0 737Parity status: clean 738Reconstruction is 100% complete. 739Parity Re-write is 100% complete. 740Copyback is 100% complete. 741.Ed 742.Pp 743This indicates that all is well with the RAID set. Of importance here 744are the component lines which read 745.Sq optimal , 746and the 747.Sq Parity status 748line which indicates that the parity is up-to-date. Note that if 749there are file systems open on the RAID set, the individual components 750will not be 751.Sq clean 752but the set as a whole can still be clean. 753.Pp 754To check the component label of /dev/sd1e, the following is used: 755.Bd -literal -offset indent 756raidctl -g /dev/sd1e raid0 757.Ed 758.Pp 759The output of this command will look something like: 760.Bd -literal -offset indent 761Component label for /dev/sd1e: 762 Row: 0 Column: 0 Num Rows: 1 Num Columns: 3 763 Version: 2 Serial Number: 13432 Mod Counter: 65 764 Clean: No Status: 0 765 sectPerSU: 32 SUsPerPU: 1 SUsPerRU: 1 766 RAID Level: 5 blocksize: 512 numBlocks: 1799936 767 Autoconfig: No 768 Last configured as: raid0 769.Ed 770.Ss Dealing with Component Failures 771If for some reason 772(perhaps to test reconstruction) it is necessary to pretend a drive 773has failed, the following will perform that function: 774.Bd -literal -offset indent 775raidctl -f /dev/sd2e raid0 776.Ed 777.Pp 778The system will then be performing all operations in degraded mode, 779where missing data is re-computed from existing data and the parity. 780In this case, obtaining the status of raid0 will return (in part): 781.Bd -literal -offset indent 782Components: 783 /dev/sd1e: optimal 784 /dev/sd2e: failed 785 /dev/sd3e: optimal 786Spares: 787 /dev/sd4e: spare 788.Ed 789.Pp 790Note that with the use of 791.Fl f 792a reconstruction has not been started. To both fail the disk and 793start a reconstruction, the 794.Fl F 795option must be used: 796.Bd -literal -offset indent 797raidctl -F /dev/sd2e raid0 798.Ed 799.Pp 800The 801.Fl f 802option may be used first, and then the 803.Fl F 804option used later, on the same disk, if desired. 805Immediately after the reconstruction is started, the status will report: 806.Bd -literal -offset indent 807Components: 808 /dev/sd1e: optimal 809 /dev/sd2e: reconstructing 810 /dev/sd3e: optimal 811Spares: 812 /dev/sd4e: used_spare 813[...] 814Parity status: clean 815Reconstruction is 10% complete. 816Parity Re-write is 100% complete. 817Copyback is 100% complete. 818.Ed 819.Pp 820This indicates that a reconstruction is in progress. To find out how 821the reconstruction is progressing the 822.Fl S 823option may be used. This will indicate the progress in terms of the 824percentage of the reconstruction that is completed. When the 825reconstruction is finished the 826.Fl s 827option will show: 828.Bd -literal -offset indent 829Components: 830 /dev/sd1e: optimal 831 /dev/sd2e: spared 832 /dev/sd3e: optimal 833Spares: 834 /dev/sd4e: used_spare 835[...] 836Parity status: clean 837Reconstruction is 100% complete. 838Parity Re-write is 100% complete. 839Copyback is 100% complete. 840.Ed 841.Pp 842At this point there are at least two options. First, if /dev/sd2e is 843known to be good (i.e. the failure was either caused by 844.Fl f 845or 846.Fl F , 847or the failed disk was replaced), then a copyback of the data can 848be initiated with the 849.Fl B 850option. In this example, this would copy the entire contents of 851/dev/sd4e to /dev/sd2e. Once the copyback procedure is complete, the 852status of the device would be (in part): 853.Bd -literal -offset indent 854Components: 855 /dev/sd1e: optimal 856 /dev/sd2e: optimal 857 /dev/sd3e: optimal 858Spares: 859 /dev/sd4e: spare 860.Ed 861.Pp 862and the system is back to normal operation. 863.Pp 864The second option after the reconstruction is to simply use /dev/sd4e 865in place of /dev/sd2e in the configuration file. For example, the 866configuration file (in part) might now look like: 867.Bd -literal -offset indent 868START array 8691 3 0 870 871START drives 872/dev/sd1e 873/dev/sd4e 874/dev/sd3e 875.Ed 876.Pp 877This can be done as /dev/sd4e is completely interchangeable with 878/dev/sd2e at this point. Note that extreme care must be taken when 879changing the order of the drives in a configuration. This is one of 880the few instances where the devices and/or their orderings can be 881changed without loss of data! In general, the ordering of components 882in a configuration file should 883.Ar never 884be changed. 885.Pp 886If a component fails and there are no hot spares 887available on-line, the status of the RAID set might (in part) look like: 888.Bd -literal -offset indent 889Components: 890 /dev/sd1e: optimal 891 /dev/sd2e: failed 892 /dev/sd3e: optimal 893No spares. 894.Ed 895.Pp 896In this case there are a number of options. The first option is to add a hot 897spare using: 898.Bd -literal -offset indent 899raidctl -a /dev/sd4e raid0 900.Ed 901.Pp 902After the hot add, the status would then be: 903.Bd -literal -offset indent 904Components: 905 /dev/sd1e: optimal 906 /dev/sd2e: failed 907 /dev/sd3e: optimal 908Spares: 909 /dev/sd4e: spare 910.Ed 911.Pp 912Reconstruction could then take place using 913.Fl F 914as describe above. 915.Pp 916A second option is to rebuild directly onto /dev/sd2e. Once the disk 917containing /dev/sd2e has been replaced, one can simply use: 918.Bd -literal -offset indent 919raidctl -R /dev/sd2e raid0 920.Ed 921.Pp 922to rebuild the /dev/sd2e component. As the rebuilding is in progress, 923the status will be: 924.Bd -literal -offset indent 925Components: 926 /dev/sd1e: optimal 927 /dev/sd2e: reconstructing 928 /dev/sd3e: optimal 929No spares. 930.Ed 931.Pp 932and when completed, will be: 933.Bd -literal -offset indent 934Components: 935 /dev/sd1e: optimal 936 /dev/sd2e: optimal 937 /dev/sd3e: optimal 938No spares. 939.Ed 940.Pp 941In circumstances where a particular component is completely 942unavailable after a reboot, a special component name will be used to 943indicate the missing component. For example: 944.Bd -literal -offset indent 945Components: 946 /dev/sd2e: optimal 947 component1: failed 948No spares. 949.Ed 950.Pp 951indicates that the second component of this RAID set was not detected 952at all by the auto-configuration code. The name 953.Sq component1 954can be used anywhere a normal component name would be used. For 955example, to add a hot spare to the above set, and rebuild to that hot 956spare, the following could be done: 957.Bd -literal -offset indent 958raidctl -a /dev/sd3e raid0 959raidctl -F component1 raid0 960.Ed 961.Pp 962at which point the data missing from 963.Sq component1 964would be reconstructed onto /dev/sd3e. 965.Pp 966When more than one component is marked as 967.Sq failed 968due to a non-component hardware failure (e.g. loss of power to two 969components, adapter problems, termination problems, or cabling issues) it 970is quite possible to recover the data on the RAID set. The first 971thing to be aware of is that the first disk to fail will almost certainly 972be out-of-sync with the remainder of the array. If any IO was 973performed between the time the first component is considered 974.Sq failed 975and when the second component is considered 976.Sq failed , 977then the first component to fail will 978.Ar not 979contain correct data, and should be ignored. When the second 980component is marked as failed, however, the RAID device will 981(currently) panic the system. At this point the data on the RAID set 982(not including the first failed component) is still self consistent, 983and will be in no worse state of repair than had the power gone out in 984the middle of a write to a filesystem on a non-RAID device. 985The problem, however, is that the component labels may now have 3 986different 'modification counters' (one value on the first component 987that failed, one value on the second component that failed, and a 988third value on the remaining components). In such a situation, the 989RAID set will not autoconfigure, and can only be forcibly re-configured 990with the 991.Fl C 992option. To recover the RAID set, one must first remedy whatever physical 993problem caused the multiple-component failure. After that is done, 994the RAID set can be restored by forcibly configuring the raid set 995.Ar without 996the component that failed first. For example, if /dev/sd1e and 997/dev/sd2e fail (in that order) in a RAID set of the following 998configuration: 999.Bd -literal -offset indent 1000START array 10011 4 0 1002 1003START drives 1004/dev/sd1e 1005/dev/sd2e 1006/dev/sd3e 1007/dev/sd4e 1008 1009START layout 1010# sectPerSU SUsPerParityUnit SUsPerReconUnit RAID_level_5 101164 1 1 5 1012 1013START queue 1014fifo 100 1015 1016.Ed 1017.Pp 1018then the following configuration (say "recover_raid0.conf") 1019.Bd -literal -offset indent 1020START array 10211 4 0 1022 1023START drives 1024/dev/sd6e 1025/dev/sd2e 1026/dev/sd3e 1027/dev/sd4e 1028 1029START layout 1030# sectPerSU SUsPerParityUnit SUsPerReconUnit RAID_level_5 103164 1 1 5 1032 1033START queue 1034fifo 100 1035.Ed 1036.Pp 1037(where /dev/sd6e has no physical device) can be used with 1038.Bd -literal -offset indent 1039raidctl -C recover_raid0.conf raid0 1040.Ed 1041.Pp 1042to force the configuration of raid0. A 1043.Bd -literal -offset indent 1044raidctl -I 12345 raid0 1045.Ed 1046.Pp 1047will be required in order to synchronize the component labels. 1048At this point the filesystems on the RAID set can then be checked and 1049corrected. To complete the re-construction of the RAID set, 1050/dev/sd1e is simply hot-added back into the array, and reconstructed 1051as described earlier. 1052.Ss RAID on RAID 1053RAID sets can be layered to create more complex and much larger RAID 1054sets. A RAID 0 set, for example, could be constructed from four RAID 10555 sets. The following configuration file shows such a setup: 1056.Bd -literal -offset indent 1057START array 1058# numRow numCol numSpare 10591 4 0 1060 1061START disks 1062/dev/raid1e 1063/dev/raid2e 1064/dev/raid3e 1065/dev/raid4e 1066 1067START layout 1068# sectPerSU SUsPerParityUnit SUsPerReconUnit RAID_level_0 1069128 1 1 0 1070 1071START queue 1072fifo 100 1073.Ed 1074.Pp 1075A similar configuration file might be used for a RAID 0 set 1076constructed from components on RAID 1 sets. In such a configuration, 1077the mirroring provides a high degree of redundancy, while the striping 1078provides additional speed benefits. 1079.Ss Auto-configuration and Root on RAID 1080RAID sets can also be auto-configured at boot. To make a set 1081auto-configurable, simply prepare the RAID set as above, and then do 1082a: 1083.Bd -literal -offset indent 1084raidctl -A yes raid0 1085.Ed 1086.Pp 1087to turn on auto-configuration for that set. To turn off 1088auto-configuration, use: 1089.Bd -literal -offset indent 1090raidctl -A no raid0 1091.Ed 1092.Pp 1093RAID sets which are auto-configurable will be configured before the 1094root file system is mounted. These RAID sets are thus available for 1095use as a root file system, or for any other file system. A primary 1096advantage of using the auto-configuration is that RAID components 1097become more independent of the disks they reside on. For example, 1098SCSI ID's can change, but auto-configured sets will always be 1099configured correctly, even if the SCSI ID's of the component disks 1100have become scrambled. 1101.Pp 1102Having a system's root file system 1103.Pq Pa / 1104on a RAID set is also allowed, 1105with the 1106.Sq a 1107partition of such a RAID set being used for 1108.Pa / . 1109To use raid0a as the root file system, simply use: 1110.Bd -literal -offset indent 1111raidctl -A root raid0 1112.Ed 1113.Pp 1114To return raid0a to be just an auto-configuring set simply use the 1115.Fl A Ar yes 1116arguments. 1117.Pp 1118Note that kernels can only be directly read from RAID 1 components on 1119alpha and pmax architectures. On those architectures, the 1120.Dv FS_RAID 1121file system is recognized by the bootblocks, and will properly load the 1122kernel directly from a RAID 1 component. For other architectures, or 1123to support the root file system on other RAID sets, some other 1124mechanism must be used to get a kernel booting. For example, a small 1125partition containing only the secondary boot-blocks and an alternate 1126kernel (or two) could be used. Once a kernel is booting however, and 1127an auto-configuring RAID set is found that is eligible to be root, 1128then that RAID set will be auto-configured and used as the root 1129device. If two or more RAID sets claim to be root devices, then the 1130user will be prompted to select the root device. At this time, RAID 11310, 1, 4, and 5 sets are all supported as root devices. 1132.Pp 1133A typical RAID 1 setup with root on RAID might be as follows: 1134.Bl -enum 1135.It 1136wd0a - a small partition, which contains a complete, bootable, basic 1137.Nx 1138installation. 1139.It 1140wd1a - also contains a complete, bootable, basic 1141.Nx 1142installation. 1143.It 1144wd0e and wd1e - a RAID 1 set, raid0, used for the root file system. 1145.It 1146wd0f and wd1f - a RAID 1 set, raid1, which will be used only for 1147swap space. 1148.It 1149wd0g and wd1g - a RAID 1 set, raid2, used for 1150.Pa /usr , 1151.Pa /home , 1152or other data, if desired. 1153.It 1154wd0h and wd0h - a RAID 1 set, raid3, if desired. 1155.El 1156.Pp 1157RAID sets raid0, raid1, and raid2 are all marked as 1158auto-configurable. raid0 is marked as being a root file system. 1159When new kernels are installed, the kernel is not only copied to 1160.Pa / , 1161but also to wd0a and wd1a. The kernel on wd0a is required, since that 1162is the kernel the system boots from. The kernel on wd1a is also 1163required, since that will be the kernel used should wd0 fail. The 1164important point here is to have redundant copies of the kernel 1165available, in the event that one of the drives fail. 1166.Pp 1167There is no requirement that the root file system be on the same disk 1168as the kernel. For example, obtaining the kernel from wd0a, and using 1169sd0e and sd1e for raid0, and the root file system, is fine. It 1170.Ar is 1171critical, however, that there be multiple kernels available, in the 1172event of media failure. 1173.Pp 1174Multi-layered RAID devices (such as a RAID 0 set made 1175up of RAID 1 sets) are 1176.Ar not 1177supported as root devices or auto-configurable devices at this point. 1178(Multi-layered RAID devices 1179.Ar are 1180supported in general, however, as mentioned earlier.) Note that in 1181order to enable component auto-detection and auto-configuration of 1182RAID devices, the line: 1183.Bd -literal -offset indent 1184options RAID_AUTOCONFIG 1185.Ed 1186.Pp 1187must be in the kernel configuration file. See 1188.Xr raid 4 1189for more details. 1190.Ss Unconfiguration 1191The final operation performed by 1192.Nm 1193is to unconfigure a 1194.Xr raid 4 1195device. This is accomplished via a simple: 1196.Bd -literal -offset indent 1197raidctl -u raid0 1198.Ed 1199.Pp 1200at which point the device is ready to be reconfigured. 1201.Ss Performance Tuning 1202Selection of the various parameter values which result in the best 1203performance can be quite tricky, and often requires a bit of 1204trial-and-error to get those values most appropriate for a given system. 1205A whole range of factors come into play, including: 1206.Bl -enum 1207.It 1208Types of components (e.g. SCSI vs. IDE) and their bandwidth 1209.It 1210Types of controller cards and their bandwidth 1211.It 1212Distribution of components among controllers 1213.It 1214IO bandwidth 1215.It 1216file system access patterns 1217.It 1218CPU speed 1219.El 1220.Pp 1221As with most performance tuning, benchmarking under real-life loads 1222may be the only way to measure expected performance. Understanding 1223some of the underlying technology is also useful in tuning. The goal 1224of this section is to provide pointers to those parameters which may 1225make significant differences in performance. 1226.Pp 1227For a RAID 1 set, a SectPerSU value of 64 or 128 is typically 1228sufficient. Since data in a RAID 1 set is arranged in a linear 1229fashion on each component, selecting an appropriate stripe size is 1230somewhat less critical than it is for a RAID 5 set. However: a stripe 1231size that is too small will cause large IO's to be broken up into a 1232number of smaller ones, hurting performance. At the same time, a 1233large stripe size may cause problems with concurrent accesses to 1234stripes, which may also affect performance. Thus values in the range 1235of 32 to 128 are often the most effective. 1236.Pp 1237Tuning RAID 5 sets is trickier. In the best case, IO is presented to 1238the RAID set one stripe at a time. Since the entire stripe is 1239available at the beginning of the IO, the parity of that stripe can 1240be calculated before the stripe is written, and then the stripe data 1241and parity can be written in parallel. When the amount of data being 1242written is less than a full stripe worth, the 1243.Sq small write 1244problem occurs. Since a 1245.Sq small write 1246means only a portion of the stripe on the components is going to 1247change, the data (and parity) on the components must be updated 1248slightly differently. First, the 1249.Sq old parity 1250and 1251.Sq old data 1252must be read from the components. Then the new parity is constructed, 1253using the new data to be written, and the old data and old parity. 1254Finally, the new data and new parity are written. All this extra data 1255shuffling results in a serious loss of performance, and is typically 2 1256to 4 times slower than a full stripe write (or read). To combat this 1257problem in the real world, it may be useful to ensure that stripe 1258sizes are small enough that a 1259.Sq large IO 1260from the system will use exactly one large stripe write. As is seen 1261later, there are some file system dependencies which may come into play 1262here as well. 1263.Pp 1264Since the size of a 1265.Sq large IO 1266is often (currently) only 32K or 64K, on a 5-drive RAID 5 set it may 1267be desirable to select a SectPerSU value of 16 blocks (8K) or 32 1268blocks (16K). Since there are 4 data sectors per stripe, the maximum 1269data per stripe is 64 blocks (32K) or 128 blocks (64K). Again, 1270empirical measurement will provide the best indicators of which 1271values will yeild better performance. 1272.Pp 1273The parameters used for the file system are also critical to good 1274performance. For 1275.Xr newfs 8 , 1276for example, increasing the block size to 32K or 64K may improve 1277performance dramatically. As well, changing the cylinders-per-group 1278parameter from 16 to 32 or higher is often not only necessary for 1279larger file systems, but may also have positive performance 1280implications. 1281.Ss Summary 1282Despite the length of this man-page, configuring a RAID set is a 1283relatively straight-forward process. All that needs to be done is the 1284following steps: 1285.Bl -enum 1286.It 1287Use 1288.Xr disklabel 8 1289to create the components (of type RAID). 1290.It 1291Construct a RAID configuration file: e.g. 1292.Sq raid0.conf 1293.It 1294Configure the RAID set with: 1295.Bd -literal -offset indent 1296raidctl -C raid0.conf raid0 1297.Ed 1298.Pp 1299.It 1300Initialize the component labels with: 1301.Bd -literal -offset indent 1302raidctl -I 123456 raid0 1303.Ed 1304.Pp 1305.It 1306Initialize other important parts of the set with: 1307.Bd -literal -offset indent 1308raidctl -i raid0 1309.Ed 1310.Pp 1311.It 1312Get the default label for the RAID set: 1313.Bd -literal -offset indent 1314disklabel raid0 \*[Gt] /tmp/label 1315.Ed 1316.Pp 1317.It 1318Edit the label: 1319.Bd -literal -offset indent 1320vi /tmp/label 1321.Ed 1322.Pp 1323.It 1324Put the new label on the RAID set: 1325.Bd -literal -offset indent 1326disklabel -R -r raid0 /tmp/label 1327.Ed 1328.Pp 1329.It 1330Create the file system: 1331.Bd -literal -offset indent 1332newfs /dev/rraid0e 1333.Ed 1334.Pp 1335.It 1336Mount the file system: 1337.Bd -literal -offset indent 1338mount /dev/raid0e /mnt 1339.Ed 1340.Pp 1341.It 1342Use: 1343.Bd -literal -offset indent 1344raidctl -c raid0.conf raid0 1345.Ed 1346.Pp 1347To re-configure the RAID set the next time it is needed, or put 1348raid0.conf into /etc where it will automatically be started by 1349the /etc/rc scripts. 1350.El 1351.Sh SEE ALSO 1352.Xr ccd 4 , 1353.Xr raid 4 , 1354.Xr rc 8 1355.Sh HISTORY 1356RAIDframe is a framework for rapid prototyping of RAID structures 1357developed by the folks at the Parallel Data Laboratory at Carnegie 1358Mellon University (CMU). 1359A more complete description of the internals and functionality of 1360RAIDframe is found in the paper "RAIDframe: A Rapid Prototyping Tool 1361for RAID Systems", by William V. Courtright II, Garth Gibson, Mark 1362Holland, LeAnn Neal Reilly, and Jim Zelenka, and published by the 1363Parallel Data Laboratory of Carnegie Mellon University. 1364.Pp 1365The 1366.Nm 1367command first appeared as a program in CMU's RAIDframe v1.1 distribution. This 1368version of 1369.Nm 1370is a complete re-write, and first appeared in 1371.Nx 1.4 . 1372.Sh COPYRIGHT 1373.Bd -literal 1374The RAIDframe Copyright is as follows: 1375 1376Copyright (c) 1994-1996 Carnegie-Mellon University. 1377All rights reserved. 1378 1379Permission to use, copy, modify and distribute this software and 1380its documentation is hereby granted, provided that both the copyright 1381notice and this permission notice appear in all copies of the 1382software, derivative works or modified versions, and any portions 1383thereof, and that both notices appear in supporting documentation. 1384 1385CARNEGIE MELLON ALLOWS FREE USE OF THIS SOFTWARE IN ITS "AS IS" 1386CONDITION. CARNEGIE MELLON DISCLAIMS ANY LIABILITY OF ANY KIND 1387FOR ANY DAMAGES WHATSOEVER RESULTING FROM THE USE OF THIS SOFTWARE. 1388 1389Carnegie Mellon requests users of this software to return to 1390 1391 Software Distribution Coordinator or Software.Distribution@CS.CMU.EDU 1392 School of Computer Science 1393 Carnegie Mellon University 1394 Pittsburgh PA 15213-3890 1395 1396any improvements or extensions that they make and grant Carnegie the 1397rights to redistribute these changes. 1398.Ed 1399.Sh WARNINGS 1400Certain RAID levels (1, 4, 5, 6, and others) can protect against some 1401data loss due to component failure. However the loss of two 1402components of a RAID 4 or 5 system, or the loss of a single component 1403of a RAID 0 system will result in the entire file system being lost. 1404RAID is 1405.Ar NOT 1406a substitute for good backup practices. 1407.Pp 1408Recomputation of parity 1409.Ar MUST 1410be performed whenever there is a chance that it may have been 1411compromised. This includes after system crashes, or before a RAID 1412device has been used for the first time. Failure to keep parity 1413correct will be catastrophic should a component ever fail -- it is 1414better to use RAID 0 and get the additional space and speed, than it 1415is to use parity, but not keep the parity correct. At least with RAID 14160 there is no perception of increased data security. 1417.Sh BUGS 1418Hot-spare removal is currently not available. 1419