Мониторинг состояния диска SSD Intel DC S3500

После установки на сервер диска SSD Intel DC S3500 необходимо было подключить его к системе мониторинга.

Для этого необходимо было обновить базу smartctl разобраться с параметрами и выбрать те которые будем мониторить.

Чтобы разобраться с параметрами нужно взять спецификацию на диск. Она нашлась по адресу http://www.intel.com/content/dam/www/public/us/en/documents/product-specifications/ssd-dc-s3500-spec.pdf

Как видно из спецификации для параметров мониторинга подходит несколько атрибутов:

05h Re-allocated Sector Count
Raw value: shows the number of retired blocks since leaving the factory (grown defect count). Normalized value: beginning at 100, shows the percent remaining of allowable grown defect count.
BBh Uncorrectable Error Count
The raw value shows the count of errors that could not be recovered using Error Correction Code (ECC). Normalized value: always 100.
C2h Temperature – Device Internal Temperature
Raw value: Reports internal temperature of the SSD in degrees Celsius. Temperature reading is the value direct from the printed circuit board (PCB) sensor without offset. Normalized value: 150 – device temperature in C degrees, 100 if device temperature less than 50.
C5h Pending Sector Count
Raw value: number of current unrecoverable read errors that will be re-allocated on next write. Normalized value: always 100.
F1h Total LBAs Written
Raw value: reports the total number of sectors written by the host system. The raw value is increased by 1 for every 65,536 sectors (32MB) written by the host. Normalized value: always 100.

Для удобства ориентации в документации индекс атрибута можно вывести в шестнадцатиричном виде:

[root@v03-t smartctl]# smartctl -A /dev/sdc | awk '/^ *[0-9]/{printf("0x%02X %s\n",$1,$0)}'
0x05   5 Reallocated_Sector_Ct   0x0032   100   100   000    Old_age   Always       -       0
0x09   9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       3177
0x0C  12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       3
0xAA 170 Available_Reservd_Space 0x0033   100   100   010    Pre-fail  Always       -       0
0xAB 171 Program_Fail_Count      0x0032   100   100   000    Old_age   Always       -       0
0xAC 172 Erase_Fail_Count        0x0032   100   100   000    Old_age   Always       -       0
0xAE 174 Unsafe_Shutdown_Count   0x0032   100   100   000    Old_age   Always       -       1
0xAF 175 Power_Loss_Cap_Test     0x0033   100   100   010    Pre-fail  Always       -       651 (19 9204)
0xB7 183 SATA_Downshift_Count    0x0032   100   100   000    Old_age   Always       -       0
0xB8 184 End-to-End_Error        0x0033   100   100   090    Pre-fail  Always       -       0
0xBB 187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
0xBE 190 Temperature_Case        0x0022   081   072   000    Old_age   Always       -       19 (Min/Max 13/28)
0xC0 192 Unsafe_Shutdown_Count   0x0032   100   100   000    Old_age   Always       -       1
0xC2 194 Temperature_Internal    0x0022   100   100   000    Old_age   Always       -       26
0xC5 197 Current_Pending_Sector  0x0032   100   100   000    Old_age   Always       -       0
0xC7 199 CRC_Error_Count         0x003e   100   100   000    Old_age   Always       -       0
0xE1 225 Host_Writes_32MiB       0x0032   100   100   000    Old_age   Always       -       302918
0xE2 226 Workld_Media_Wear_Indic 0x0032   100   100   000    Old_age   Always       -       2252
0xE3 227 Workld_Host_Reads_Perc  0x0032   100   100   000    Old_age   Always       -       51
0xE4 228 Workload_Minutes        0x0032   100   100   000    Old_age   Always       -       190640
0xE8 232 Available_Reservd_Space 0x0033   100   100   010    Pre-fail  Always       -       0
0xE9 233 Media_Wearout_Indicator 0x0032   098   098   000    Old_age   Always       -       0
0xEA 234 Thermal_Throttle        0x0032   100   100   000    Old_age   Always       -       0/0
0xF1 241 Host_Writes_32MiB       0x0032   100   100   000    Old_age   Always       -       302918
0xF2 242 Host_Reads_32MiB        0x0032   100   100   000    Old_age   Always       -       320022

Вот на основании полученных данных скрипт который прочитает SMART регистры диска Intel серии DC S3500 напишет о общем состоянии диска. В строке параметров команды printf лишние пробелы добавлены для того чтобы не ошибиться с количеством и расположением параметров.

#!/bin/bash
smartctl -A /dev/sdc | awk -v prev=0 '/^22[5-8]/{\
if ($1==225) {
  value[1]=($10-prev)*65535*512/1000000000
} else if($1==226) {
  value[2]=$10/1024
} else if($1==227) {
  value[3]=$10
} else if($1==228) {
  value[4]=$10
  value[5]=$10/60/24
}
}END{
  printf("The workload took %s minutes (%s days) to complete with %s%% reads and %s%% writes. A total of %sGB of data was written to the device, which increased the media wear in the drive by %s%%. At this point in time, this workload is causing a wear rate of %s%% for every %s minutes, or %s%%/hour.\n",
                            value[4],   value[5],                 value[3],      100-value[3],           value[1],                                                                              value[2],                                                            value[2],      value[4],      value[2]/value[4]*60);
}'

А вот результат его работы:

The workload took 190640 minutes (132.389 days) to complete with 51% reads and 49% writes. A total of 10164.1GB of data was written to the device, which increased the media wear in the drive by 2.19922%. At this point in time, this workload is causing a wear rate of 2.19922% for every 190640 minutes, or 0.000692159%/hour.

А такой скрипт можно использовать в zabbix

!/bin/bash

  if [[ -z "$1" ]] ; then
    echo -e "ZABBIX PARAM NEED [?]"
    exit
  fi
  if [[ ! "$1" =~ ^sd[a-z]+$ ]] ; then
    echo -e "INVALID ZABBIX PARAM[$1]"
    exit
  fi
  if [ ! -b "/dev/$1" ] ; then
    echo -e "No block device "/dev/$1" found"
    exit
  fi
  RESULT=`/usr/sbin/smartctl -A "/dev/$1" | awk '/^ *226/{printf("%d\n", $10/1024)}'`
  if [ -z "${RESULT}" ] ; then
    echo -e "SMART Error"
  fi

  if [ "${RESULT}" -le 20 ] ; then
    echo "OK ${RESULT}"
  else
    echo "Wearout ${RESULT}%"
  fi

При уровне износа SSD диска до 20% будет возвращён статус “OK”. После 20% статус будет содержать процент износа.