[XCSSA] Re: [SATLUG] OT: SMART Drive Reset
xcssa@xcssa.org
xcssa@xcssa.org
Sat, 26 May 2007 01:44:37 -0500
On Friday 25 May 2007 17:42, scs@worldlinkisp.com wrote:
[...]
> The drive works fine, it just a PITA viewing the
> warning and having to hit F1 every time it boots.
There are two type of SMART failures..
1) The most commonly known type are the passive "threshold" theoretical errors
that the BIOS tells you about once the drives parameters (temp, error rate,
rpms, etc) pass certain manufacturer pre-programmed thresholds (some times
you can trust.. some times they're worthless).. but it usually indicates that
there's some kind of problem.
2) The less common, real world read/write tests that you can manually tell the
drive to run and report back to you on. For this test, there is a short and
a long variety. The short test (usually less than 3 minutes) is around 93%
accurate in detecting non false positive errors.. and the long around 97%
(according to the author of smarttools).
Ignore the BIOS error crap for now. In linux, to perform a real r/w test, do
this:
As root, turn on Smart on the drive (regardless of BIOS setting):
# smartctl -s on /dev/hda
for SATA drives, like this:
# smartctl -s on /dev/sda -d ata
smartctl version 5.36 [x86_64-redhat-linux-gnu] Copyright (C) 2002-6 Bruce
...
=== START OF ENABLE/DISABLE COMMANDS SECTION ===
SMART Enabled.
Now, issue the short test command and wait 5 minutes:
# smartctl -t short /dev/hda ; sleep 300
for PATA... and for SATA:
# smartctl -t short /dev/sda -d ata ; sleep 300
smartctl version 5.36 [x86_64-redhat-linux-gnu] Copyright (C) 2002-6 Bruce
...
=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Short self-test routine immediately in
off-line mode".
Drive command "Execute SMART Short self-test routine immediately in off-line
mode" successful.
Testing has begun.
Please wait 2 minutes for test to complete.
Test will complete after Sat May 26 01:17:05 2007
Use smartctl -X to abort test.
Now read the results (for my SATA):
# smartctl -a /dev/sda -d ata|grep -B1 -A1 ^#
Num Test_Description Status Remaining LifeTime(hours) LBA_of_error
# 1 Short offline Completed 00% 765 -
The Completed and 00% shows that it made it through the whole test without
error. If there had been an error, you would have seem something like:
Num Test_Description Status Remaining LifeTime(hours) LBA_of_error
# 1 Short offline Write Error 75% 765 76799128
or something like that..
The cool thing about this r/w test is that it is non-destructive to the OS
filesystem and can be run on line with little to no performance degradation
nor any danger (or much) to system stability. These r/w test access reserved
blocks on the drive across the full drive surface.. so it can even detect
physical problems all across the platter surface. It's pretty sweet.
Anyway.. if you detect an error like this with a short or long test.. then get
your data off ASAP. If the drive is sick, I would NOT attempt a "cp -a" of
file copy to get the data off.. as slewing the heads a lot can drive the
nails into the coffin on an already sick drive. Likewise.. don't attempt a
plain-jane ghosting of the data off either.. either of these could kill it.
Either do a straight dd of the data/important partitions off onto another
device (on another channel or bus, such as hdc), or ddrescue (available on
the Helix live distro: http://www.e-fense.com/helix/) the data off. If
you're more familiar with or prefer Ghost, DO NOT use it as you regularly
would (this will kill a sick drive).. but instead invoke it with
the "Forensic mode" switch like this:
A:\ ghost.exe -ir
(image raw)
Anyway.. get the data off the drive to a safe place and then you can futz with
the sick drive. That is.. unless the data is not important to you.. :)
Tweeks