RAID, Can it Fail? If it Does is Data Recovery Possible?

Originally, as envisaged in 1987 by Patterson,files, those less than the sum of a strip each
Gibson and Katz from the University of Californiafrom the working drive there will be files that are
in Berkeley, the acronym RAID stood for afortunately intact, for larger files (e.g. Exchange or
"Redundant Array of Inexpensive Disks". In shortSQL databases) there will be considerable data
a larger number of smaller cheaper disks could beloss and structural damage and low level work will
used in place of a single much more expensivebe required to salvage any useful data from
large hard disk, or even to create a disk that wasthem.
larger than any currently available.For RAID levels where there is parity and the
They went a stage further and postulated achance to recover from a single disk failure then
variety of options that would not only result inthe most common problems were see are:
getting a big disk for a lower cost, but couldDegraded running
improve performance, or increase reliability at theA single disk fails and is ignored, or there is not a
same time. Partly the options for improvedspare available and so one is ordered. Either way
reliability were required as using multiple disksthe RAID unit stays in operation but with a disk
gave a reduction in themissing so there is no longer any redundancy.
Mean-Time-Between-Failure, divide the MTBF for aUsually the hard disks in a RAID are part of the
drive in the array by the number of drives andsame manufacturing batch, have been stored and
theoretically a RAID will fail more quickly than arun in the same environment, if the unit has been
single disk.mis-handled then each disk in the RAID has been
Today RAID is usually described as a "Redundantmis-handled. So, there is quite a good chance that
Array of Independent Disks", technology hasanother drive will fail sometime soon, if not for
moved on and even the most costly disks areany of the reasons just given but because bad
not particularly expensive.things don't happen singly.
Six levels of RAID were originally defined, someMultiple failure
geared towards performance, others to improvedStriped RAID is fault tolerant if a single drive fails
fault tolerance, though the first of these did notnice and cleanly. If multiple drives fail then the
have any redundancy or fault-tolerance so mightRAID is lost, but also if one drive fails and
not truly be considered RAID.de-stabilises the SCSI bus. This can result in
RAID 0 - Striped and not really "RAID"multiple drives appearing to fail, the RAID unit
RAID 0 provides capacity and speed but notbelieves that they have failed, and so the RAID
redundancy, data is striped across the drives withwill not operate.
all of the benefits that gives, but if one drive failsConfiguration loss
the RAID is dead just as if a single hard disk driveWhen a RAID is configured information is stored
fails.about the order of the disks the size of a strip of
This is good for transient storage wheredata and so on. If there is a failure within the
performance matters but the data is eitherRAID controller and this information is lost then
non-critical or a copy is also kept elsewhere. Otherthe RAID will no operate, and it is not always
RAID levels are more suited for critical systemspracticable to re-instate it.
where backups might not be up-to-the-minute, orSome RAID controllers will consider
down-time is undesirable.re-programming the RAID configuration as a
RAID 1 - Mirroringrebuild request and re-write to each of the disks
RAID 1 is often used for the boot devices indestroying the data.
servers or for critical data where reliabilityPeople making it worse
requirements are paramount. Usually 2 hard diskOne of the worst sounds we hear with RAID
drives are used and any data written to one diskproblems is that of human panic, and frantic
is also written to the other.attempts to repair the problem. "We're just going
In the event of a failure of one drive the systemto try one more thing" is often the sound that
can switch to single drive operation, the failedsignals the end of the data as a RAID is repaired
drive replaced and the data transferred to awith the disks in the wrong slots, or rebuild and
replacement drive to rebuild the mirror.set back to its original state.
RAID 2What to do when a RAID fails
RAID 2 introduced error correction codeSTOP
generation to compensate for drives that did notTHINK
have their own error detection. There are noMake sure that anything you do is going to be
such drives now, and have not been for a longnon-destructive.
time. RAID 2 is not really used anywhere.Get Advice
RAID 3 - Dedicated ParityDo not let anyone push you into precipitous
RAID 3 uses striping, down to the byte level. Thisaction, they might have a deadline and be applying
adds a hardware overhead for no apparentpressure but they will quickly forget their part in
benefit. It also introduces "parity" or errordriving proceedings when the RAID is fatally
correction data on a separate drive so andamaged by a hurried repair attempt.
additional hard disk is needed that gives greaterHow can data be recovered from a RAID?
security but no additional space.Much of RAID recovery is the same as for a
RAID 4 - Dedicated Paritysingle disk recovery, data must be secured and
RAID 4 stripes to the block level, and like RAID 3backed up to guarantee that the problem will not
stores parity information on a dedicated drive.be exacerbated. For logical problems the difficult
RAID 5 - The most common formatwork is all on the analysis of the file system, that
RAID 5 stripes at the block level but does notit is from a RAID makes no major difference
use a single dedicated drive for storing parity.once the RAID scheme has been identified and
Instead, parity is interspersed within the data, sothe correct access to it worked out.
after each run of data stripes there is a strip ofFor mirrored RAID data can be "mixed and
parity data, but this changes then for the nextmatched" from the good sectors of two drives
set of stripes.to rebuild a good drive. With striped RAID
This could means, for example, that in a 3 diskschemes that use parity then data can be rebuild
RAID 5 there are data strips on disks 0 and 1at the stripe level rather than on a per drive basis
followed by a parity strip on disk 2. For the nextso if there are bad sectors throughout more than
set of stripes the data is on disks 0 and 2 withone drive these can be corrected individually.
the parity on disk 1, then data on disks 1 and 2With non-redundant RAID schemes each sector
with parity on disk 0.that cannot read from a disk results in data loss
RAID 5 is generally faster for smaller reads, sofrom the RAID set. For redundant RAID
eminently suitable for server systems beingschemes, however, there is much that can be
shared by large numbers of users created smallerdone to rebuild when data is missing. Whilst a
data files or accessing smaller amounts of dataRAID controller will take a disk off-line when it
each time. For other applications, however, RAIDfails and operate in degraded mode rebuilding the
4 will outperform RAID 5 quite considerably.data from the missing disk on demand, a data
Beyond RAID 5?recovery process can be somewhat more
Advances on RAID 5 do exist, though in generalsophisticated. With properly written recovery
these use RAID 5 techniques and enhance them,software the level of granularity can be one
for example by mirroring two RAID 5 arrays, orsector rather than one disk so for each sector
by having 2 parity stripes.that fails the data can be rebuild so long as all
RAID data recoverysectors can be recovered from the remainder of
It might be imaged that with all of this faultthe disks. Even if the next failed sector is on a
tolerance that data recovery would not be adifferent drive in the set, so long as the same
requirement, but things will still go wrong.sector can be read from the other disks then a
With all RAID levels logical corruption, damage tocomplete rebuild can be made.
the file system, has just as devastating effect asFor levels of RAID that have greater redundancy,
with a single hard disk. You might have a robustlythe number of failed sectors across a set of
stored file system, but it is a robustly stored anddisks can be even greater without data loss.
corrupted file system.Even as data recovery specialists we are,
With RAID 0 the result of a failure of one disk ishowever, still bound by the rules of mathematics.
terminal for the RAID, if data cannot beIf sector 99 is missing from both disks 0 and 4 in
recovered from the failed disk then a percentagea RAID5 set then rebuilding of the missing data is
of the data is lost for good, and since RAID usesnot a possibility.
data striping, this could be like losing 1 MB of dataOnce the raid/disk issues have been resolved
out of every 4 MB, and the chances of thatthen the data recovery process can continue just
leaving any major files intact are low. For smalleras it would for a single disk.