После установки на сервер диска SSD Intel DC S3500 необходимо было подключить его к системе мониторинга.
Для этого необходимо было обновить базу smartctl разобраться с параметрами и выбрать те которые будем мониторить.
Чтобы разобраться с параметрами нужно взять спецификацию на диск. Она нашлась по адресу http://www.intel.com/content/dam/www/public/us/en/documents/product-specifications/ssd-dc-s3500-spec.pdf
Как видно из спецификации для параметров мониторинга подходит несколько атрибутов:
Raw value: shows the number of retired blocks since leaving the factory (grown defect count). Normalized value: beginning at 100, shows the percent remaining of allowable grown defect count.
BBh Uncorrectable Error Count
The raw value shows the count of errors that could not be recovered using Error Correction Code (ECC). Normalized value: always 100.
C2h Temperature – Device Internal Temperature
Raw value: Reports internal temperature of the SSD in degrees Celsius. Temperature reading is the value direct from the printed circuit board (PCB) sensor without offset. Normalized value: 150 – device temperature in C degrees, 100 if device temperature less than 50.
C5h Pending Sector Count
Raw value: number of current unrecoverable read errors that will be re-allocated on next write. Normalized value: always 100.
F1h Total LBAs Written
Raw value: reports the total number of sectors written by the host system. The raw value is increased by 1 for every 65,536 sectors (32MB) written by the host. Normalized value: always 100.
Для удобства ориентации в документации индекс атрибута можно вывести в шестнадцатиричном виде:
[root@v03-t smartctl]# smartctl -A /dev/sdc | awk '/^ *[0-9]/{printf("0x%02X %s\n",$1,$0)}' 0x05 5 Reallocated_Sector_Ct 0x0032 100 100 000 Old_age Always - 0 0x09 9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 3177 0x0C 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 3 0xAA 170 Available_Reservd_Space 0x0033 100 100 010 Pre-fail Always - 0 0xAB 171 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 0 0xAC 172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0 0xAE 174 Unsafe_Shutdown_Count 0x0032 100 100 000 Old_age Always - 1 0xAF 175 Power_Loss_Cap_Test 0x0033 100 100 010 Pre-fail Always - 651 (19 9204) 0xB7 183 SATA_Downshift_Count 0x0032 100 100 000 Old_age Always - 0 0xB8 184 End-to-End_Error 0x0033 100 100 090 Pre-fail Always - 0 0xBB 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 0xBE 190 Temperature_Case 0x0022 081 072 000 Old_age Always - 19 (Min/Max 13/28) 0xC0 192 Unsafe_Shutdown_Count 0x0032 100 100 000 Old_age Always - 1 0xC2 194 Temperature_Internal 0x0022 100 100 000 Old_age Always - 26 0xC5 197 Current_Pending_Sector 0x0032 100 100 000 Old_age Always - 0 0xC7 199 CRC_Error_Count 0x003e 100 100 000 Old_age Always - 0 0xE1 225 Host_Writes_32MiB 0x0032 100 100 000 Old_age Always - 302918 0xE2 226 Workld_Media_Wear_Indic 0x0032 100 100 000 Old_age Always - 2252 0xE3 227 Workld_Host_Reads_Perc 0x0032 100 100 000 Old_age Always - 51 0xE4 228 Workload_Minutes 0x0032 100 100 000 Old_age Always - 190640 0xE8 232 Available_Reservd_Space 0x0033 100 100 010 Pre-fail Always - 0 0xE9 233 Media_Wearout_Indicator 0x0032 098 098 000 Old_age Always - 0 0xEA 234 Thermal_Throttle 0x0032 100 100 000 Old_age Always - 0/0 0xF1 241 Host_Writes_32MiB 0x0032 100 100 000 Old_age Always - 302918 0xF2 242 Host_Reads_32MiB 0x0032 100 100 000 Old_age Always - 320022
Вот на основании полученных данных скрипт который прочитает SMART регистры диска Intel серии DC S3500 напишет о общем состоянии диска. В строке параметров команды printf лишние пробелы добавлены для того чтобы не ошибиться с количеством и расположением параметров.
#!/bin/bash smartctl -A /dev/sdc | awk -v prev=0 '/^22[5-8]/{\ if ($1==225) { value[1]=($10-prev)*65535*512/1000000000 } else if($1==226) { value[2]=$10/1024 } else if($1==227) { value[3]=$10 } else if($1==228) { value[4]=$10 value[5]=$10/60/24 } }END{ printf("The workload took %s minutes (%s days) to complete with %s%% reads and %s%% writes. A total of %sGB of data was written to the device, which increased the media wear in the drive by %s%%. At this point in time, this workload is causing a wear rate of %s%% for every %s minutes, or %s%%/hour.\n", value[4], value[5], value[3], 100-value[3], value[1], value[2], value[2], value[4], value[2]/value[4]*60); }'
А вот результат его работы:
The workload took 190640 minutes (132.389 days) to complete with 51% reads and 49% writes. A total of 10164.1GB of data was written to the device, which increased the media wear in the drive by 2.19922%. At this point in time, this workload is causing a wear rate of 2.19922% for every 190640 minutes, or 0.000692159%/hour.
А такой скрипт можно использовать в zabbix
!/bin/bash if [[ -z "$1" ]] ; then echo -e "ZABBIX PARAM NEED [?]" exit fi if [[ ! "$1" =~ ^sd[a-z]+$ ]] ; then echo -e "INVALID ZABBIX PARAM[$1]" exit fi if [ ! -b "/dev/$1" ] ; then echo -e "No block device "/dev/$1" found" exit fi RESULT=`/usr/sbin/smartctl -A "/dev/$1" | awk '/^ *226/{printf("%d\n", $10/1024)}'` if [ -z "${RESULT}" ] ; then echo -e "SMART Error" fi if [ "${RESULT}" -le 20 ] ; then echo "OK ${RESULT}" else echo "Wearout ${RESULT}%" fi
При уровне износа SSD диска до 20% будет возвращён статус “OK”. После 20% статус будет содержать процент износа.