SMART (Self-Monitoring, Analysis, and Reporting Technology) is a technology included in most hard drives today. You can take advantage of this technology to determine and test for hard drive failure on running systems. Almost all linux distributions systems include the smartmontools package. (I say almost because its impossible to be familiar with all of them.) Here are some handy commands used to take advantage of the reporting and testing features of the linux smart tools.

Please note that I am using the device /dev/hda in the following examples, this may or may not be the storage device in your system.

Print the overall health of a drive

smartctl -H /dev/sda
smartctl 5.39.1 2010-01-28 r3054 [x86_64-redhat-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED

As you can see my device currently has a passing grade. This is not however a final result, this simply means that the drive has not failed any previous tests or found any problems during the time it has run after that test. To tell the drive to dig a little deeper you can use smartctl to do some tests, lets do that now.

Short test

According to the documentation, this command can be given during normal system operation (unless run in captive mode).

smartctl --test=short /dev/sda
smartctl 5.39.1 2010-01-28 r3054 [x86_64-redhat-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Short self-test routine immediately in off-line mode".
Drive command "Execute SMART Short self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 1 minutes for test to complete.
Test will complete after Sat Jul 21 12:27:19 2012

When looking at the response I got a little scared when I saw “off-line mode” but that simply means that the test will run as the machine is functioning normally. You will notice that this test will take around a minute to complete. After which you can use the aforementioned overall command to get a quick result of the test, best to do this after the test has completed.

Long Test

The long test can also be run on a live system, and will do a lot deeper testing on the device, however it will take significantly longer to finish.

smartctl --test=long /dev/sda
smartctl 5.39.1 2010-01-28 r3054 [x86_64-redhat-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION === Sending command: "Execute SMART Extended self-test routine immediately in off-line mode".
Drive command "Execute SMART Extended self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 86 minutes for test to complete. Test will complete after Sat Jul 21 14:04:28 2012

86 minutes is a far cry from the short test’s 1 minute time, but again this is a much more detailed test.

Getting it all

The next and last command will output all the information the drive can possibly give. In the response below I have selectively removed a lot of output because there is a lot information to go through. My main point is the command and something I will get to in just a second.

smartctl -a /dev/sda

Device Model:     ###########
Serial Number:    ########
Firmware Version: ####
User Capacity:    ###,###,###,### bytes
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 4
Local Time is:    Sat Jul 21 13:24:32 2012 CDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled SMART

Self-test log structure revision number 1
Num  Test_Description    Status                        Remaining LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Self-test routine in progress 50%       11320            -
# 2  Short offline       Completed without error       00%       11319            -
# 3  Short offline       Completed without error       00%       8721             -
# 4  Short offline       Completed without error       00%       1                -

If you look you can see that the long test is in progress and about 50% complete. Running either the overall health output command or the detailed command before a test is finished won’t hurt anything, but it also won’t tell you what the result of the currently running test is until after its finished.

I don’t have a failing drive, at least according to SMART. That’s great news! The downside of this is that I don’t have output of a failing drive to put here, but a little google-fu can give you some examples of what you don’t want to see as well as what some of the detailed information means.

A few notes about the detailed output

ID# ATTRIBUTE_NAME          FLAG   VALUE WORST THRESH TYPE     UPDATED WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f 114   099   006    Pre-fail Always      -       69591434
  3 Spin_Up_Time            0x0003 097   097   000    Pre-fail Always      -       0
  4 Start_Stop_Count        0x0032 100   100   020    Old_age  Always      -       28
  5 Reallocated_Sector_Ct   0x0033 100   100   036    Pre-fail Always      -       0
  7 Seek_Error_Rate         0x000f 081   060   030    Pre-fail Always      -       166638935
  9 Power_On_Hours          0x0032 088   088   000    Old_age  Always      -       11321
 10 Spin_Retry_Count        0x0013 100   100   097    Pre-fail Always      -       0
 12 Power_Cycle_Count       0x0032 100   100   020    Old_age  Always      -       14
184 End-to-End_Error        0x0032 100   100   099    Old_age  Always      -       0
187 Reported_Uncorrect      0x0032 100   100   000    Old_age  Always      -       0
188 Command_Timeout         0x0032 100   097   000    Old_age  Always      -       4295032861
189 High_Fly_Writes         0x003a 098   098   000    Old_age  Always      -       2
190 Airflow_Temperature_Cel 0x0022 072   065   045    Old_age  Always      -       28 (Lifetime Min/Max 26/28)
194 Temperature_Celsius     0x0022 028   040   000    Old_age  Always      -       28 (0 22 0 0)
195 Hardware_ECC_Recovered  0x001a 033   028   000    Old_age  Always      -       69591434
197 Current_Pending_Sector  0x0012 100   100   000    Old_age  Always      -       0
198 Offline_Uncorrectable   0x0010 100   100   000    Old_age  Offline     -       0
199 UDMA_CRC_Error_Count    0x003e 200   200   000    Old_age  Always      -       0

The Old_age in ‘TYPE’ does not mean the drive is old or past its life expectancy, it simply means that the value of that attribute is expected to change over the life of the drive. The ‘Pre-fail’ does not mean that the drive is failing either, it simply means that the attribute is below or at the hardware manufactures expectations.

The Reallocated_Sector_Ct is a good value to keep an eye on, each drive has a few spare sectors for those that fail, and typically some drives will have an occasional bad sector, however a large number here might be indicative of problems to come.

A note about raid controllers

You can also get the smart status of drives behind a some raid controllers using

smartctl -H -d 3ware,P /dev/twa#

Where P is equal to the drive port.