Neural Networks to Predict Data Centre Failures

Article By : Rick Merritt

Hewlett-Packard Enterprise is using neural networks to predict failures on some of the 4 million hard disk drives that its InfoSight service monitors.

SAN JOSE, Calif. — Hewlett-Packard Enterprise is using neural networks to predict failures on some of the 4 million hard disk drives that its InfoSight service monitors. The project taught HPE that using neural networks takes time, specialized expertise, and some big iron.

The first challenge of the ongoing program was curating data sets on failures of select models of hard drives. The job took six to nine months given that engineers had to wade through a pool of 2 petabytes of system data spanning two years. The flip side is that if you don’t have enough data, you won’t be able to adequately train a neural network.

HPE’s engineers used the data sets to train a standard neural-network model that they chose. But they found that the system worked much better if they created custom neural-net classifiers for the different models of drives that they were monitoring and wrote their own code implementing them.

The optimizations took another six to nine months and a team of three Ph.D.-level data scientists — hard and sometimes expensive skills to find in these days of an AI boom. But it paid off.

“As we move to a custom model — and one of my Ph.D.s is still working on it — we’ve seen ten- to hundredfold speedups over using standard open-source libraries,” said Christopher Cheng, a distinguished technologist at HPE who supervises the project.

Once the system is running, it’s important to keep data sets and neural-net models up to date. You also have to look out for false positives, Cheng said, something that HPE aims to knock down to less than half the actual failure rates in the next six to nine months.


HPE uses a bank of up to 20 two-socket Xeon servers, some with GPU accelerator cards, to train and run inference on its custom neural network. Click to enlarge. (Source: HPE)

Given that HPE is one of the world’s largest server makers, putting the hardware in place was relatively easy. Companies without extra racks of computers on hand and experts on staff may find this aspect more challenging.

Today, the work runs on a bank of up to 20 two-socket HPE Gen 9 servers, some using Nvidia GPU cards as accelerators. The hardware can train a new model in less than a day and run an inference job in less than an hour.

For HPE, the short-term goal was improving reliability of its customers’ systems. So far, results are significantly better than using traditional statistical methods. That may not be true for neural nets in all use cases.

Cheng declined to share any actual ROI numbers. And he noted that results depend, in part, on how quickly customers reacted to predictions, sending a technician into a data center to replace a drive.

The project is successful and strategic enough that HPE is continuing the work. It already has a team collecting data to predict memory-card failures, and a separate group will explore predicting blips in software performance. Engineers also aim to collapse the six to nine months that it took to create the first data sets.

Long term, the work is part of a broad vision of an internet of things in which all devices are monitored with a combination of neural networks and traditional techniques. “We were trying to prepare for the IoT and using machine learning to manage it,” Cheng said.

HPE teaser

Leave a comment