SMART (Self-Monitoring, Analysis, and Reporting Technology) is a technology included in most hard drives today. You can take advantage of this technology to determine and test for hard drive failure on running systems. Almost all linux distributions systems include the smartmontools package. (I say almost because its impossible to be familiar with all of them.) Here are some handy commands used to take advantage of the reporting and testing features of the linux smart tools.
Please note that I am using the device /dev/hda in the following examples, this may or may not be the storage device in your system.
Print the overall health of a drive
smartctl -H /dev/sda smartctl 5.39.1 2010-01-28 r3054 [x86_64-redhat-linux-gnu] (local build) Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED
As you can see my device currently has a passing grade. This is not however a final result, this simply means that the drive has not failed any previous tests or found any problems during the time it has run after that test. To tell the drive to dig a little deeper you can use smartctl to do some tests, lets do that now.
According to the documentation, this command can be given during normal system operation (unless run in captive mode).
smartctl --test=short /dev/sda smartctl 5.39.1 2010-01-28 r3054 [x86_64-redhat-linux-gnu] (local build) Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net === START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION === Sending command: "Execute SMART Short self-test routine immediately in off-line mode". Drive command "Execute SMART Short self-test routine immediately in off-line mode" successful. Testing has begun. Please wait 1 minutes for test to complete. Test will complete after Sat Jul 21 12:27:19 2012
When looking at the response I got a little scared when I saw “off-line mode” but that simply means that the test will run as the machine is functioning normally. You will notice that this test will take around a minute to complete. After which you can use the aforementioned overall command to get a quick result of the test, best to do this after the test has completed.
The long test can also be run on a live system, and will do a lot deeper testing on the device, however it will take significantly longer to finish.
smartctl --test=long /dev/sda smartctl 5.39.1 2010-01-28 r3054 [x86_64-redhat-linux-gnu] (local build) Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net === START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION === Sending command: "Execute SMART Extended self-test routine immediately in off-line mode". Drive command "Execute SMART Extended self-test routine immediately in off-line mode" successful. Testing has begun. Please wait 86 minutes for test to complete. Test will complete after Sat Jul 21 14:04:28 2012
86 minutes is a far cry from the short test’s 1 minute time, but again this is a much more detailed test.
Getting it all
The next and last command will output all the information the drive can possibly give. In the response below I have selectively removed a lot of output because there is a lot information to go through. My main point is the command and something I will get to in just a second.
smartctl -a /dev/sda Device Model: ########### Serial Number: ######## Firmware Version: #### User Capacity: ###,###,###,### bytes Device is: Not in smartctl database [for details use: -P showall] ATA Version is: 8 ATA Standard is: ATA-8-ACS revision 4 Local Time is: Sat Jul 21 13:24:32 2012 CDT SMART support is: Available - device has SMART capability. SMART support is: Enabled SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Self-test routine in progress 50% 11320 - # 2 Short offline Completed without error 00% 11319 - # 3 Short offline Completed without error 00% 8721 - # 4 Short offline Completed without error 00% 1 -
If you look you can see that the long test is in progress and about 50% complete. Running either the overall health output command or the detailed command before a test is finished won’t hurt anything, but it also won’t tell you what the result of the currently running test is until after its finished.
I don’t have a failing drive, at least according to SMART. That’s great news! The downside of this is that I don’t have output of a failing drive to put here, but a little google-fu can give you some examples of what you don’t want to see as well as what some of the detailed information means.
A few notes about the detailed output
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 114 099 006 Pre-fail Always - 69591434 3 Spin_Up_Time 0x0003 097 097 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 28 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 081 060 030 Pre-fail Always - 166638935 9 Power_On_Hours 0x0032 088 088 000 Old_age Always - 11321 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 14 184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 188 Command_Timeout 0x0032 100 097 000 Old_age Always - 4295032861 189 High_Fly_Writes 0x003a 098 098 000 Old_age Always - 2 190 Airflow_Temperature_Cel 0x0022 072 065 045 Old_age Always - 28 (Lifetime Min/Max 26/28) 194 Temperature_Celsius 0x0022 028 040 000 Old_age Always - 28 (0 22 0 0) 195 Hardware_ECC_Recovered 0x001a 033 028 000 Old_age Always - 69591434 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
Old_age in ‘TYPE’ does not mean the drive is old or past its life expectancy, it simply means that the value of that attribute is expected to change over the life of the drive. The ‘Pre-fail’ does not mean that the drive is failing either, it simply means that the attribute is below or at the hardware manufactures expectations.
Reallocated_Sector_Ct is a good value to keep an eye on, each drive has a few spare sectors for those that fail, and typically some drives will have an occasional bad sector, however a large number here might be indicative of problems to come.
A note about raid controllers
You can also get the smart status of drives behind a some raid controllers using
smartctl -H -d 3ware,P /dev/twa#
Where P is equal to the drive port.