Blog: Six Lessons from the CPU Meltdown

Article By : Phil White

The chief technologist of a computer hardware and software company shares some basic principles for plugging the security gaps in the next Meltdown or Spectre.

In January, Intel surprised us all with the news that it’s Meltdown and Spectre CPU firmware patches were no good. The instructions were to immediately cease distributing and uploading the firmware patches.

Unfortunately, people who had downloaded the patch could not uninstall it, creating endless reboots and unpredictable system behavior. Now even AMD has joined the debacle as it faces class action lawsuits over how it responded to the flaws.

Some OEMs were better prepared than others with dedicated labs and processes to test patches before they are implemented. Others, not so much. But there are a few things we can do to help protect our customers and ourselves.

1. Maintain extra CPU headroom: It’s important to have enough CPU resources in place to handle workloads in all failure scenarios. It is clear we also need to take into account software mitigations for this new class of hardware flaws–mitigations which also may significantly affect performance.

2. Be prepared to respond: One of the biggest frustrations over this incident was the apparent lack of processes in place to address flaws like Meltdown and Spectre. In this case, Intel was late to release the microcode to fix the flaws, and even then allowed it out the door with bugs. OEMs need to define a path through their quality and assurance departments that ensures a software build or patch is ready to mitigate security vulnerabilities and downtime.

3. Be flexible and adaptable: If you don’t have processes in place to address fixes quickly, at least have the flexibility to drop other things and shift gears quickly to get the job done. Adjust priorities as needed to establish the resources to test patched systems are running with a level of stability that meets your comfort level. Have a team ready to support customers who are pushing the performance envelope.

4. Internal and external communications are key: When the patch flaw was revealed, my company created internal communications with employees to help them understand the severity of the issue, how it impacted system vulnerability and what we were doing to address it. As a result, our teams were ready with answers when our customers called.

5. Automate testing and know the variables: Test automation speeds the process of applying microcode and OS patches as they come down from vendors. Make sure you can replicate as many customer scenarios as possible to provide an accurate assessment. You also need to know whether customers are using untrusted code.

6. It takes a trusted village: If our processor and software manufacturers can’t be open and honest, we’re all going to have to look after each other. Think open source communities. If someone spots something odd during a testing process, they can inform others rather than wait for an official statement or patch release from the manufacturer.

For this type of community to work, we must trust each other–even our competitors. For our mutual security, we need an environment where data can be shared for testing. We also must honor each other’s proprietary secrets. Until such an environment is in place, we need to rely on our internal processes, automation and communication to ensure we’re delivering quality products and timely fixes.

–Phil White is Chief Technical Officer of Scale Computing.

Leave a comment