Meltdown, Spectre, and why hardware can be correct yet insecure

The recent Meltdown and Spectre attacks have exposed, or at least emphasized, a fundamental problem with the conventional approach to computer security at the hardware level. Both of these attacks rely on side channels in conventional processor designs. By exploiting these side channels, an untrusted program can learn the contents of the operating system kernel’s memory or of another process’s memory, bypassing the standard hardware protection mechanisms based on virtual memory.

The initial response from Intel was apparently that Meltdown and Spectre did not reveal any bugs in their processor implementations — in other words, that their processors were implemented correctly. What’s perhaps more surprising is that they have a pretty good case for making that claim. According to the commonly used definitions of correctness, the processors of Intel, AMD, and other manufacturers — all vulnerable to Spectre — are indeed correct with respect to the way they are used in the attack.

Since these processors are clearly insecure, it probably sounds very odd to claim that they are correct. The way to untangle this paradox is to observe that the usual notion of correctness is broken, at least with respect to security. If we look at this usual notion of correctness more closely, we’ll see that it does not capture what it ought to. And that will lead us to a solution that is already beginning to be applied to build more secure hardware.

Correctness as refinement

What does it mean for a processor implementation to be correct? The conventional notion of correctness is defined with respect to a specification. The specification, which in this case is defined by architecture manuals from Intel and AMD, defines a set of ways that the hardware is allowed to behave when it executes the various possible sequences of instructions. A given processor implementation is considered to be correct as long as its behavior when executing any given sequence of instructions is one of the possible behaviors allowed by the specification.

You might wonder why the specification doesn’t just say exactly how the hardware behaves for every instruction, rather than giving the implementation some flexibility to pick one of the allowed behaviors. The reason is that the same specification is used for many years and many different generations of processors. If the specification said, for example, that reading from memory always took 200 cycles, then designers would never be allowed to use clever tricks (as indeed they have) to speed up those reads. If they did, software that relied on the old specification would run incorrectly on the new generation of processors.

So the conventional idea in computer science is that an implementation is correct if its behaviors are a subset of the allowed behaviors. Another way of saying this is that a correct implementation refines its specification. Unfortunately, this very notion of refinement allows the kind of insecurity seen in Spectre and Meltdown.

Refinement attacks

Suppose that a processor had a silly instruction called zap, whose specification is that it changes the contents of one of the processor’s registers to be an arbitrary one-byte value (in the range 0–255). Given this specification, the Generation 1 processor implementation could set the register’s value to be always 0. This would be a correct implementation. The Generation 2 implementation could simply zero out all the bits in the register, but leave alone the eight bits corresponding to a one-byte value. This would also be correct.

The zap instruction has what we call a nondeterministic specification, because it allows multiple behaviors. Most processors are not this nondeterministic with respect to the values placed into registers, but processors are quite nondeterministic with respect to the time that instructions take to complete. This nondeterminism with respect to time is especially a feature of instructions that read or write to the computer’s memory, and of branch instructions. Both of these kinds of instructions are relatively expensive, and computer architects work very hard to make them run faster, which is (crucially) allowed by the nondeterministic specification.

It’s probably not obvious that zap could create an insecurity. But suppose Generation 3 implemented it in the following crazy way: the processor uses the contents of the register as a memory address, and then replaces the register’s contents with the byte stored at that memory address, ignoring all hardware memory protection. Now the seemingly harmless zap instruction can be used to directly read any memory in the system, including kernel memory. Of course, no one would intentionally implement zap this way, but the point is that this crazy, insecure implementation is perfectly correct: it refines the same specification that the Generation 1 and Generation 2 processors implemented.

While zap is a bit contrived, Meltdown and Spectre are also attacks that exploit seemingly harmless refinement of nondeterminism. Here, the specification is nondeterministic about the amount of time required to perform memory accesses and branches. Nothing in the processor specification says that the amount of time required to perform these operations doesn’t encode some information about the contents of kernel memory. The clever trick of Meltdown and Spectre attacks is to use the interaction between speculative execution and some internal state of the processor—memory caches and branch prediction tables—to create such an encoding. This encoding of information into time is what’s called a timing channel. Like the Generation 3 zap, these timing channels arise from a dangerous refinement of a nondeterministic specification—a refinement attack.

The attacks are described in more detail elsewhere, but the key idea is to leak information between different possible execution paths. In Meltdown, a speculative execution that reads kernel memory has effects on the state of the processor’s data cache before the processor manages to figure out that the read to kernel memory should not have happened in the first place. The processor then reruns the execution, but the data cache has already been affected. And changes to the data cache create a timing channel because they affect how long it takes to access memory.

Timing channels

A timing channel encodes information in time. How does the attacker get information out of a timing channel? Simply by writing a program that measures how long operations take. Timing measurements convert the information in the timing channel back into ordinary data values. And these ordinary data values correspond to the information stored in the memory of the kernel or of other programs. The effect of Meltdown on the data cache depends on the contents of kernel memory, so timing does too.

People have known for more than a decade that processors have timing channels, and they were already worrying about the way that timing channels can be exploited by one program executing in the cloud to steal information from other programs executing in the same datacenter machine. But Meltdown and Spectre show that timing channels are even more dangerous than previously thought: they can be used by a program to read parts of the computer’s memory that weren’t supposed to be accessible, subverting the most fundamental security mechanisms provided by the hardware.

Is there a fix?

If refinement of nondeterministic specifications allows insecure implementations, there seems to be something really broken about our notion of correctness. We need a more powerful way to talk about what it means for an implementation to be correct and secure. We need a more precise kind of specification that allows processor designers to use clever tricks to improve processor performance, but that doesn’t allow them to design in dangerous timing channels. And here there is actually some new science to be done, because there is no standard way to express such specifications.

Fortunately, we don’t have to start from scratch, because researchers have been thinking for decades about how to specify the leakage of information from computing systems and about how to program computing systems in such a way that they don’t leak information. Unfortunately, most of this research hasn’t affected the mainstream ways of building computing systems. The new attacks are showing that it probably should have.

Secure hardware description languages

Existing widely used processors all contain timing channels, and have for decades. Meltdown and Spectre have shown that these timing channels and their implications are extremely subtle. Do hardware designers have a hope of building hardware that doesn’t have timing channels or other side channels?

In recent work, we and others shown that it is possible to build processors that provably do not have side channels. The key insight is that we can augment the hardware description languages (HDLs) that designers use today (such as Verilog or VHDL) and extend them with annotations about information security — for example, to specify which information in the processor is secret and which is public.

Our SecVerilog language extends Verilog with these kinds of annotations. If a processor is implemented in SecVerilog, the SecVerilog compiler uses these annotations to detect and report any side channels to the user at design time. Because enforcement happens almost entirely at design time, the overhead of SecVerilog appears to be very small. When we used our SecVerilog language to build a secure MIPS processor, its area, power, and cycle time are all within 1% of a baseline processor implemented using regular Verilog. In the process of designing this processor, we also discovered — and were forced by SecVerilog to eliminate — an unexpected timing channel arising inside the processor. Our experience with this simple processor suggests that the approach taken in SecVerilog could be applied to a more complex processor with full speculative execution, thereby eliminating the threat of Meltdown, Spectre, and other side-channel attacks. However, implementing a full x86 processor is for now a massive task beyond the reach of a couple of graduate students. And even with a secure HDL, actually closing the channels created by speculation will not be easy; my colleague Adrian Sampson has pointed out some of the challenges.

Noninterference and hyperproperties

We started by observing that the standard notions of correctness were inadequate to capture the security requirements of modern processors. Let’s loop back to that fundamental issue. How do we know that a secure HDL like SecVerilog actually makes things better?

The standard way to specify the lack of insecure information flows is using the concept of noninterference, a term coined by Goguen and Meseguer although the same idea was introduced earlier by Ellis Cohen. Unlike correctness defined by refinement, where we could examine the possible executions of the system one by one, noninterference asks us to think about two hypothetical executions of the same computing system. The computing system takes in inputs and produces outputs. However, some of the inputs are secrets and some are public. Similarly, some of the outputs are secrets and some are public. Suppose we compare two executions that take in exactly the same public input P but different secret inputs S1 and S2, as depicted by the following figure:

Noninterference (1)

Noninterference then requires that the public output P′ from both executions be exactly the same. Otherwise, an observer who saw different public outputs could infer that the secrets are different in the two executions, and thereby learn something about the secrets—information would leak!

The key difference between noninterference and standard notions of correctness is that noninterference talks about the relationship between multiple executions of the system, and not just single executions. Noninterference is thus an example of a relational property, one that relates pairs of executions. More generally, it is a hyperproperty, a term introduced by Clarkson and Schneider, which is defined in terms of allowed sets of executions rather than in term of allowed executions. Since noninterference only requires comparing pairs of executions, it is called a 2-hyperproperty.

While this may sound like a lot of scary mathematics for the poor hardware designer, the good news is that noninterference and related hyperproperties can be enforced by HDLs like SecVerilog. The programmer doesn’t have to think about noninterference directly; if hardware is developed using SecVerilog, the SecVerilog compiler automatically verifies that it satisfies (a generalization of) noninterference.

Now, noninterference is a good starting point, but it is too rigid: real systems need more control over where and when information flows. For example, the kernel doesn’t want user processes to have the full power to read its memory, but a kernel often releases information to user processes that is computed using the contents of its memory. Such an information release violates noninterference.

In fact, how to specify the information security requirements of real systems is still a largely open, promising area of research. In one recent result, we’ve shown how to use a 4-hyperproperty we call nonmalleable information flow to capture the interactions between secret and untrusted information, including the intentional release of secret information (declassification) and intentional use of untrusted information (endorsement). Since it’s a 4-hyperproperty, it involves comparing sets of four executions. Makes a 2-hyperproperty like noninterference look simple! Fortunately, nonmalleable information flow can also be enforced at the language level.

Lessons for the future

The Meltdown/Spectre attacks are a flashing warning sign that our existing science of security and correctness is not quite up to the task. Standard notions of correctness are subject to refinement attacks, and standard ways of designing hardware lead to dangerous side channels. But I certainly would disagree with claims that these attacks mark “the end of langsec“; if anything, a particular flavor of language-based security seems like the solution. Much of the necessary intellectual capital to solve these problems exists already but has not seen enough use.

But I don’t want to suggest that the problem is solved, either. More research is needed in order for us even to be able to specify what it really means for a complex system like an Intel processor to be secure, let alone the software that runs on it. And more research and engineering are needed to develop a new generation of hardware description languages and software languages in which we can build computing systems with confidence that they are secure.


Credits. Our efforts on secure hardware have been a collaboration with my colleague Ed Suh at Cornell, and the real work was done by some amazing students: Danfeng Zhang (now teaching at Penn State), Yao Wang, Andrew Ferraiuolo, and Rui Xu. The work on hyperproperties for information flow was done with some other terrific Cornell students: Owen Arden (now teaching at UC Santa Cruz) and Ethan Cecchetti.

References and additional reading

Advertisements

3 thoughts on “Meltdown, Spectre, and why hardware can be correct yet insecure

  1. Thanks for a very useful post. I am wondering that in the post you mentioned about timing channels and automatically enforcing noninterference property with SecVeriLog. However, noninterference, in general, cannot say about timing channels, in fact, it says that there is no flow of data from high security label values to low security label values.

    Beside noninterference, I think we need a way to reason about the execution time of instructions (resource usage in general) and require that every pair of executions related to secret value (e.g. branching on secret value) consume the same amount of time (cycles). We can do that with a resource bound analysis at a language level.

    1. Although some statements of noninterference focus on low-security values, noninterference as originally defined by Goguen and Meseguer is actually more general; it talks about the influence on low-security _output events_. As long as we build it into our formal model that time is considered to be an output of the system, noninterference captures information leaking via time. Similarly, we can capture resource usage channels by considering the amount of resources consumed to be an observable output.

Comments are closed.