Fixing SMART errors
Although this is focused on solving “Offline uncorrectable sectors” or “Pending Sector read” SMART errors, the same troubleshooting may be applicable to other issues.
Check for disk status
Although may be a single error, we should not even discard a real problem on the disk which may be not reliable or safe or even near to complete fail. The best way to ensure we can ignore this errors is to monitor it for some days and checking if the number of errors (or count) grows or is fixed on a single problem.
smartctl -a /dev/DISK smartctl -A /dev/DISK
Should show us detailed info about the hardware as well as the critical health parameters and status.
In our case the attribute table is showing that there is 1 sector “Offline uncorrectable”:
SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0027 152 135 021 Pre-fail Always - 3358 4 Start_Stop_Count 0x0032 098 098 000 Old_age Always - 2611 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0 9 Power_On_Hours 0x0032 079 079 000 Old_age Always - 15640 10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 098 098 000 Old_age Always - 2610 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 683 193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 2611 194 Temperature_Celsius 0x0022 109 098 000 Old_age Always - 34 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 1 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0 240 Head_Flying_Hours 0x0032 080 080 000 Old_age Always - 14689 241 Total_LBAs_Written 0x0032 200 200 000 Old_age Always - 19017927647 242 Total_LBAs_Read 0x0032 200 200 000 Old_age Always - 10787143978
Normally when a surface problem is detected, the firmware itself may decide to 'relocate' that sector on another one transparently to avoid the problematic sector. For this reason is recommended to check ReallocatedEventCount value to see if the disk had relocated many sectors (which may indicate that mulitple errors were found in the past and maybe is better to replace the disk).
Current Pending Sector is just that, the number of locations the disk knows about that needs to be reallocated but haven't reallocated yet since the disk has no source for the data to be reallocated. Once you write into that location the disk will automatically reallocate the area to another place and write the new data in the new place and the current pending sector cound will decrease.
In our example, the OfflineUncorrectable errors is 1 and ReallocatedEvent_Count is 0, so after some days without appearing newer errors we decide to 'clear' the error condition, that's forcing to write on the sector to force the driver rellocate it.
Identify where exactly is the problem
Check LBA
From the output of smartctl
we can get details about where the problematic sector was found:
SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed without error 00% 13566 - # 2 Extended offline Completed: read failure 90% 13554 34066483 # 3 Short offline Completed without error 00% 0 - 1 of 1 failed self-tests are outdated by newer successful extended offline self-test # 1
So we have that the LBA sector were the problem was found is 34066483
Check partition
Now we will check on which partition falls that sector:
# fdisk -lu /dev/sdb Disk /dev/sdb: 250.0 GB, 250000000000 bytes 255 heads, 63 sectors/track, 30394 cylinders, total 488281250 sectors Units = sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disk identifier: 0x4dc5fddf Device Boot Start End Blocks Id System /dev/sdb1 2048 16779263 8388608 83 Linux /dev/sdb2 16779264 50333695 16777216 83 Linux /dev/sdb3 50333696 113248255 31457280 83 Linux
So, falls into sdb2
Check the data inside the partition
As the partition is a ext3 filesystem, the sector may be (oops, … was) containing data. We need some math to found that, we start checking block size of the filesystem:
# tune2fs -l /dev/sdb2 | grep Block Block count: 4194304 Block size: 4096 Blocks per group: 32768
Now we can determine which File System Block contains that LBA sector:
In our case:
(int) (34066483 - 16779264)*512/4096 2160902.375
(The decimal part is that the block extends on 8 sectors, so our sector is the 3rd sector in the block 2160902)
Check for data in the block
use debugfs to locate the inode stored in this block, and the file that contains that inode:
# debugfs debugfs 1.42.5 (29-Jul-2012) debugfs: open /dev/sdb2 debugfs: icheck 2160902 Block Inode number 2160902 8 debugfs: ncheck 8 Inode Pathname debugfs:
We're lucky, no inode was on that block, so we can safely overwrite it:
Force reallocation
To force the disk to reallocate this bad block we'll write zeros to the bad block, and sync the disk:
dd if=/dev/zero of=/dev/sdb2 bs=4096 count=1 seek=2160902