Smartmontools

Smartmontools permet de dialoguer avec de nombreux disques-dur pour obtenir des informations et statistiques sur leur état, permettant de diagnostiquer un disque-dur en fin de vie notamment.

Commandes de base

Activer:

# smartctl --smart=on --offlineauto=on --saveauto=on /dev/hda
smartctl version 5.32 Copyright (C) 2002-4 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF ENABLE/DISABLE COMMANDS SECTION ===
SMART Enabled.
SMART Attribute Autosave Enabled.
SMART Automatic Offline Testing Enabled every four hours.

ou plus facile à retenir:

smartctl -s on -S on -o on

Vérifier les valeurs:

# smartctl -A /dev/hda
smartctl version 5.32 Copyright (C) 2002-4 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  3 Spin_Up_Time            0x0027   252   252   063    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   253   253   000    Old_age   Always       -       0
  5 Reallocated_Sector_Ct   0x0033   253   253   063    Pre-fail  Always       -       0
  6 Read_Channel_Margin     0x0001   253   253   100    Pre-fail  Offline      -       0
  7 Seek_Error_Rate         0x000a   253   252   000    Old_age   Always       -       0
  8 Seek_Time_Performance   0x0027   250   245   187    Pre-fail  Always       -       58790
  9 Power_On_Minutes        0x0032   251   251   000    Old_age   Always       -       714h+26m
 10 Spin_Retry_Count        0x002b   252   252   157    Pre-fail  Always       -       0
 11 Calibration_Retry_Count 0x002b   252   252   223    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   253   253   000    Old_age   Always       -       83
192 Power-Off_Retract_Count 0x0032   253   253   000    Old_age   Always       -       0
193 Load_Cycle_Count        0x0032   253   253   000    Old_age   Always       -       0
194 Temperature_Celsius     0x0032   253   253   000    Old_age   Always       -       31
195 Hardware_ECC_Recovered  0x000a   253   252   000    Old_age   Always       -       4749
196 Reallocated_Event_Count 0x0008   253   253   000    Old_age   Offline      -       0
197 Current_Pending_Sector  0x0008   253   253   000    Old_age   Offline      -       0
198 Offline_Uncorrectable   0x0008   253   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0008   199   199   000    Old_age   Offline      -       0
200 Multi_Zone_Error_Rate   0x000a   253   252   000    Old_age   Always       -       0
201 Soft_Read_Error_Rate    0x000a   253   252   000    Old_age   Always       -       6
202 TA_Increase_Count       0x000a   253   252   000    Old_age   Always       -       0
203 Run_Out_Cancel          0x000b   253   252   180    Pre-fail  Always       -       2
204 Shock_Count_Write_Opern 0x000a   253   252   000    Old_age   Always       -       0
205 Shock_Rate_Write_Opern  0x000a   253   252   000    Old_age   Always       -       0
207 Spin_High_Current       0x002a   252   252   000    Old_age   Always       -       0
208 Spin_Buzz               0x002a   252   252   000    Old_age   Always       -       0
209 Offline_Seek_Performnce 0x0024   186   183   000    Old_age   Offline      -       0
 99 Unknown_Attribute       0x0004   253   253   000    Old_age   Offline      -       0
100 Unknown_Attribute       0x0004   253   253   000    Old_age   Offline      -       0
101 Unknown_Attribute       0x0004   253   253   000    Old_age   Offline      -       0

Vérifiez la colonne "WHEN_FAILED" essentiellement.

Voir plus de choses:

# smartctl -a /dev/hda

Sur mon disque SATA (non IDE):

# smartctl -d ata ... /dev/sda

Les tests:

smartctl -t short /dev/hda
smartctl -t long /dev/hda

Vérifier les résultats:

smartctl -l selftest /dev/hda

Mauvais blocs

Les disques durs actuels ont apparemment une réserve de blocs non utilisés, qui peuvent remplacer des blocs défectueux. Le remplacement se fait au moment de l'écriture sur le bloc défectueux (pas avant, pour laisser une chance de récupérer les données qui y sont inscrites).

Le Bad block HOWTO de la section lien discute de ce sujet plus en détail.

Dans la sortie de smartmontools, les éléments importants sont:

Reallocated_Sector_Ct: les secteurs réalloués au cours de la vie du disque
Current_Pending_Sector: les secteurs à réallouer dès que possible
Reallocated_Event_Count: (je ne sais pas bien, à completer)

Exemple sur un disque de 2 ans (regarder la colonne RAW_VALUE), après un badblocs lecture/écriture complet (les blocs défectueux ont tous été réalloués):

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
[...]
  5 Reallocated_Sector_Ct   0x0033   192   192   140    Pre-fail  Always       -       62
[...]
196 Reallocated_Event_Count 0x0032   182   182   000    Old_age   Always       -       18
197 Current_Pending_Sector  0x0012   200   199   000    Old_age   Always       -       0

Autres exemples

Disque en train de mourir (plein d'erreurs disques), état après un badblocks partiel:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   001   001   051    Pre-fail  Always   FAILING_NOW 2144
  3 Spin_Up_Time            0x0007   100   091   021    Pre-fail  Always       -       2366
  4 Start_Stop_Count        0x0032   099   099   040    Old_age   Always       -       1152
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000b   200   200   051    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   099   099   000    Old_age   Always       -       1305
 10 Spin_Retry_Count        0x0013   100   100   051    Pre-fail  Always       -       0
 11 Calibration_Retry_Count 0x0013   100   100   051    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   099   099   000    Old_age   Always       -       1128
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0012   200   149   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0012   099   099   000    Old_age   Always       -       330
199 UDMA_CRC_Error_Count    0x000a   200   253   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0009   200   174   051    Pre-fail  Offline      -       0

smartd

Pensez à activer le démon smartd, celui-ci vous préviendra en cas de problème détecté (courriel envoyé à root).

Sous Debian, dans /etc/default/smartmontools, décommenter:

#start_smartd=yes

Exemple:

This email was generated by the smartd daemon running on:

   host name: bob
  DNS domain: centre.local
  NIS domain: (none)

The following warning/error was logged by the smartd daemon:

Device: /dev/hda, ATA error count increased from 10 to 11

For details see host's SYSLOG (default: /var/log/messages).

You can also use the smartctl utility for further investigation.
No additional email messages about this problem will be sent.

Dans le syslog:

Nov  7 01:27:20 bob kernel: hda: dma_intr: status=0x51 { DriveReady SeekComplete Error }
Nov  7 01:27:20 bob kernel: hda: dma_intr: error=0x40 { UncorrectableError }, LBAsect=10997362, high=0, low=10997362, sector=10997359
Nov  7 01:27:20 bob kernel: ide: failed opcode was: unknown
Nov  7 01:27:20 bob kernel: end_request: I/O error, dev hda, sector 10997359
Nov  7 01:28:20 bob smartd[3499]: Device: /dev/hda, SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 200 to 195
Nov  7 01:28:20 bob smartd[3499]: Device: /dev/hda, SMART Usage Attribute: 194 Temperature_Celsius changed from 99 to 97
Nov  7 01:28:20 bob smartd[3499]: Device: /dev/hda, ATA error count increased from 10 to 11
Nov  7 01:28:20 bob smartd[3499]: Sending warning via /usr/share/smartmontools/smartd-runner to root ...
Nov  7 01:28:21 bob smartd[3499]: Warning via /usr/share/smartmontools/smartd-runner to root: successful

Cependant j'ai aussi eu, mais sans notification cette fois:

Nov  7 09:58:20 bob smartd[3499]: Device: /dev/hda, SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 193 to 185
Nov  7 09:58:20 bob smartd[3499]: Device: /dev/hda, SMART Usage Attribute: 194 Temperature_Celsius changed from 96 to 95
Nov  7 09:58:20 bob smartd[3499]: Device: /dev/hda, ATA error count increased from 11 to 13

Donc je ne sais pas dans quels cas l notification courriel est envoyée :/

Smartmontools

Sommaire

Commandes de base

Mauvais blocs

Autres exemples

smartd

Liens

Menu de navigation

Rechercher