Fixing SMART errors

Although this is focused on solving “Offline uncorrectable sectors” or “Pending Sector read” SMART errors, the same troubleshooting may be applicable to other issues.

Although may be a single error, we should not even discard a real problem on the disk which may be not reliable or safe or even near to complete fail. The best way to ensure we can ignore this errors is to monitor it for some days and checking if the number of errors (or count) grows or is fixed on a single problem.

smartctl -a /dev/DISK
smartctl -A /dev/DISK

Should show us detailed info about the hardware as well as the critical health parameters and status.

In our case the attribute table is showing that there is 1 sector “Offline uncorrectable”:

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   152   135   021    Pre-fail  Always       -       3358
  4 Start_Stop_Count        0x0032   098   098   000    Old_age   Always       -       2611
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   079   079   000    Old_age   Always       -       15640
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   098   098   000    Old_age   Always       -       2610
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       683
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       2611
194 Temperature_Celsius     0x0022   109   098   000    Old_age   Always       -       34
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       1
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0
240 Head_Flying_Hours       0x0032   080   080   000    Old_age   Always       -       14689
241 Total_LBAs_Written      0x0032   200   200   000    Old_age   Always       -       19017927647
242 Total_LBAs_Read         0x0032   200   200   000    Old_age   Always       -       10787143978

Normally when a surface problem is detected, the firmware itself may decide to 'relocate' that sector on another one transparently to avoid the problematic sector. For this reason is recommended to check ReallocatedEventCount value to see if the disk had relocated many sectors (which may indicate that mulitple errors were found in the past and maybe is better to replace the disk).

Current Pending Sector is just that, the number of locations the disk knows about that needs to be reallocated but haven't reallocated yet since the disk has no source for the data to be reallocated. Once you write into that location the disk will automatically reallocate the area to another place and write the new data in the new place and the current pending sector cound will decrease.

In our example, the OfflineUncorrectable errors is 1 and ReallocatedEvent_Count is 0, so after some days without appearing newer errors we decide to 'clear' the error condition, that's forcing to write on the sector to force the driver rellocate it.

From the output of smartctl we can get details about where the problematic sector was found:

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%     13566         -
# 2  Extended offline    Completed: read failure       90%     13554         34066483
# 3  Short offline       Completed without error       00%         0         -
1 of 1 failed self-tests are outdated by newer successful extended offline self-test # 1

So we have that the LBA sector were the problem was found is 34066483

Now we will check on which partition falls that sector:

# fdisk -lu /dev/sdb
Disk /dev/sdb: 250.0 GB, 250000000000 bytes
255 heads, 63 sectors/track, 30394 cylinders, total 488281250 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x4dc5fddf

   Device Boot      Start         End      Blocks   Id  System
/dev/sdb1            2048    16779263     8388608   83  Linux
/dev/sdb2        16779264    50333695    16777216   83  Linux
/dev/sdb3        50333696   113248255    31457280   83  Linux

So, falls into sdb2

As the partition is a ext3 filesystem, the sector may be (oops, … was) containing data. We need some math to found that, we start checking block size of the filesystem:

# tune2fs -l /dev/sdb2 | grep Block
Block count:              4194304
Block size:               4096
Blocks per group:         32768

Now we can determine which File System Block contains that LBA sector:

fsblocknum = (int) ((LBA - partitionstart) * 512) / fsblock_size)

In our case:

(int) (34066483 - 16779264)*512/4096 2160902.375

(The decimal part is that the block extends on 8 sectors, so our sector is the 3rd sector in the block 2160902)

use debugfs to locate the inode stored in this block, and the file that contains that inode:

# debugfs
debugfs 1.42.5 (29-Jul-2012)
debugfs:  open /dev/sdb2
debugfs:  icheck 2160902
Block   Inode number
2160902 8
debugfs:  ncheck 8
Inode   Pathname
debugfs:

We're lucky, no inode was on that block, so we can safely overwrite it:

To force the disk to reallocate this bad block we'll write zeros to the bad block, and sync the disk:

dd if=/dev/zero of=/dev/sdb2 bs=4096 count=1 seek=2160902

http://www.gra2.com/article.php/20041015232512624

doc

Fixing SMART errors

Check for disk status

Identify where exactly is the problem

Check LBA

Check partition

Check the data inside the partition

Check for data in the block

Force reallocation

Lady 3Jane