Have you ever wanted to predict that a piece of hardware in your server was failing before it actually caused the server to crash?
Sure! We all do.
Over the past few months, I have been tracking the correlation between errors logged to the Machine Check Event Log (MCElog) and the hard crash of a server or application running on that server (mostly MySQL). So far, the correlation is about 90%. That is to say, about 9 times out of 10, there will be an error logged to the MCElog before the server actually crashes. It may take days or even weeks between the time of the logged error and the crash, but it will happen. We are now actively monitoring this log and replacing hardware (RAM and CPUs) which show errors before they actually fail which I thought was pretty cool, so I thought I would share how we are doing it.
On Debian, there is a package for the mcelog utility which will allow you to decode and display the kernel messages logged to /dev/mcelog Part of this package is a cron job which outputs the decoded contents of /dev/mcelog to /var/log/mcelog every 5 minutes:
*/5 * * * * root test -x /usr/sbin/mcelog -a ! -e /etc/mcelog-disabled && /usr/sbin/mcelog --ignorenodev --filter >> /var/log/mcelog
We modify this a little bit and add another cron job which rotates this log file on reboot:
@reboot root test -f /var/log/mcelog && mv /var/log/mcelog /var/log/mcelog.0
The reason we do this is because after a reboot, which is most likely a result of the hardware repair, we want to clear the active logfile (monitored by the nagios plugin below), so the alert will clear. In case, however, the reboot was not part of the hardware maintenance, we still want to have a record of the hardware errors so we move the log file to mcelog.0.
We then have a simple nagios plugin which monitors /var/log/mcelog for errors:
#!/bin/bash LOGFILE=/var/log/mcelog if [ ! -f "$LOGFILE" ] then echo "No logfile exists" exit 3 else ERRORS=$( grep -c "HARDWARE ERROR" /var/log/mcelog ) if [ $ERRORS -eq 0 ] then echo "OK: $ERRORS hardware errors found" exit 0 elif [ $ERRORS -gt 0 ] then echo "WARNING: $ERRORS hardware errors found" exit 1 fi fi
And thats pretty much it. In just a few weeks we have caught about a dozen hardware faults before they led to server crashes.
Disclaimer: This only works when running a X86_64 kernel and YMMV.
相关推荐
13-Predicting Tie Strength With Social Media
Predicting and Generating Wallpaper Texture with Semantic Properties
A novel Multi-Agent Ada-Boost algorithm for predicting protein structural class with the information of protein secondary structure.
Predicting Continuous Target Variables with Regression Analysis Working with Unlabeled Data - Clustering Analysis Implementing a Multilayer Artificial Neural Network from Scratch Parallelizing Neural ...
In the context of artificial intelligence marketing, there are a wide array of predictive analytic techniques available to achieve this purpose, each with its own unique advantages and disadvantages....
Deep generative model with domain adversarial training for predicting arterial blood pressure waveform from photoplethysmogram signal
Why does this matter for predicting growth and business cycles, or for predicting other economic phenomena? In the rest of this chapter, we discuss how Machine Learning methodologies are useful to ...
使用emd和svm进行预测使用svm进行预测使用emd和svm进行预测使用svm进行预测使用emd和svm进行预测使用svm进行预测(Predicting with emd and svm Predicting with svm Predicting with emd and svm Predicting with ...
PEDLA: predicting enhancers with a deep learning-based algorithmic framework.
Facebook V Predicting Check Ins数据集
Predicting school performance with the early screening inventory Psychology in the Schools Volume 21, January. 1984 PREDICTING SCHOOL PERFORMANCE WITH THE EARLY SCREENING INVENTORY SAMUEL J . ...
文章2:全文Identifying and Predicting Autism Spectrum Disorder Based on Multi-Site Structural MRI With Machine Learning.pdf
Predicting Malicious Behavior fuses the behavioral and computer sciences to enlighten anyone concerned with security and to aid professionals in keeping our world safer. Table of Contents Part I ...
采用基于三参数立方型状态方程的热力学模型预测凝析气压缩因子,李长俊,彭阳,气体压缩因子作为一种基础的热力学参数,通常用于天然气工程中的PVT相态特性分析。为了准确预测不同凝析气压缩因子,本文基于三参
DeepSignals- Predicting Intent of Drivers Through Visual Signals
DeepSignals- Predicting Intent of Drivers Through Visual Signals
At least they do if you are writing a book with the words ‘arti cial intelligence’ in the title. But what is arti cial intelligence? Critical as this eld is, it appears that there is no clear de ...
Different users, Different Opinions: Predicting Search Satisfaction with Mouse Movement Information
介绍的网络上在线文本的预测模型,今年的热点研究问题
The book presents both the traditional and classical methods as well as the most recent and cutting edge advances, providing the reader with a panorama of the challenges and solutions in predicting ...