[XCSSA] Re: [SATLUG] OT: SMART Drive Reset

xcssa@xcssa.org xcssa@xcssa.org
Sat, 26 May 2007 01:44:37 -0500


On Friday 25 May 2007 17:42, scs@worldlinkisp.com wrote:
[...]
> The drive works fine, it just a PITA viewing the
> warning and having to hit F1 every time it boots.

There are two type of SMART failures.. 
1) The most commonly known type are the passive "threshold" theoretical errors 
that the BIOS tells you about once the drives parameters (temp, error rate, 
rpms, etc) pass certain manufacturer pre-programmed thresholds (some times 
you can trust.. some times they're worthless).. but it usually indicates that 
there's some kind of problem.
2) The less common, real world read/write tests that you can manually tell the 
drive to run and report back to you on.  For this test, there is a short and 
a long variety.  The short test (usually less than 3 minutes) is around 93% 
accurate in detecting non false positive errors.. and the long around 97% 
(according to the author of smarttools).

Ignore the BIOS error crap for now.  In linux, to perform a real r/w test, do 
this:

As root, turn on Smart on the drive (regardless of BIOS setting):
	# smartctl -s on /dev/hda

for SATA drives, like this:
	# smartctl -s on /dev/sda -d ata
	smartctl version 5.36 [x86_64-redhat-linux-gnu] Copyright (C) 2002-6 Bruce
	...
	=== START OF ENABLE/DISABLE COMMANDS SECTION ===
	SMART Enabled.

Now, issue the short test command and wait 5 minutes:
	# smartctl -t short /dev/hda ; sleep 300

for PATA... and for SATA:
	# smartctl -t short /dev/sda -d ata ; sleep 300
	smartctl version 5.36 [x86_64-redhat-linux-gnu] Copyright (C) 2002-6 Bruce 	
	...
	=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
	Sending command: "Execute SMART Short self-test routine immediately in 	
	off-line mode".
	Drive command "Execute SMART Short self-test routine immediately in off-line 
	mode" successful.
	Testing has begun.
	Please wait 2 minutes for test to complete.
	Test will complete after Sat May 26 01:17:05 2007
	
	Use smartctl -X to abort test.

Now read the results (for my SATA):
	# smartctl -a /dev/sda -d ata|grep -B1 -A1 ^#
	Num  Test_Description    Status    Remaining  LifeTime(hours) LBA_of_error
	# 1  Short offline       Completed    00%       765           -

The Completed and 00% shows that it made it through the whole test without 
error.  If there had been an error, you would have seem something like:
	Num  Test_Description    Status    Remaining  LifeTime(hours) LBA_of_error
	# 1  Short offline       Write Error  75%       765           76799128

or something like that..

The cool thing about this r/w test is that it is non-destructive to the OS 
filesystem  and can be run on line with little to no performance degradation 
nor any danger (or much) to system stability.  These r/w test access reserved 
blocks on the drive across the full drive surface.. so it can even detect 
physical problems all across the platter surface.  It's pretty sweet.

Anyway.. if you detect an error like this with a short or long test.. then get 
your data off ASAP.  If the drive is sick, I would NOT attempt a "cp -a" of 
file copy to get the data off.. as slewing the heads a lot can drive the 
nails into the coffin on an already sick drive.  Likewise.. don't attempt a 
plain-jane ghosting of the data off either.. either of these could kill it.

Either do a straight dd of the data/important partitions off onto another 
device (on another channel or bus, such as hdc), or ddrescue (available on 
the Helix live distro: http://www.e-fense.com/helix/) the data off.  If 
you're more familiar with or prefer Ghost, DO NOT use it as you regularly 
would (this will kill a sick drive).. but instead invoke it with 
the "Forensic mode" switch like this:
	A:\ ghost.exe -ir
	(image raw)

Anyway.. get the data off the drive to a safe place and then you can futz with 
the sick drive.  That is.. unless the data is not important to you.. :)

Tweeks