Andrew Myers

Security, programming languages, and computer systems

Leave a comment

A pet peeve about hash tables

The hash table is a wonderful data structure. Unfortunately no one wraps it in the right abstraction. Typically, hash table implementations do some hashing internally, which is insufficient unless you’re hashing pointers, and wastes time if you’re already providing a good hash function.

But how do you know if your hash function is good enough? Hash-table implementations make it hard to diagnose when a cheap hash function is causing clustering, because set and map abstractions hide the bucket array from the user. This leads to a modest proposal: all set and map abstractions wrapped around a hash table should provide  a method for measuring clustering. When I teach hash tables, I  suggest one reasonable way to measure clustering.


Deserialization considered harmful

I’ve done a fair amount of work on persistent object systems, starting with the Thor distributed storage system and more recently, the Fabric system. For programmers, the great thing about this approach to persistence is that your persistent data looks almost exactly like your nonpersistent data: it can be packaged up into objects that don’t require any special coding style to create and modify. These systems save you the trouble of writing often tedious deserialization/unmarshaling code whose purpose is to translate from the persistent representation into the in-memory representation. Object-relational mapping (ORM) systems (e.g, Hibernate, Django) have become highly popular for the same reason: they reduce the impedance mismatch between the in-memory and persistent representations. (However, ORMs have some weaknesses in programmability and performance compared to a “real” persistent object systems.)

One standard argument against persistent objects is that serializing and deserializing data cleans the data, by fixing inconsistencies or simply by discarding garbage. But this argument ignores the danger of deserialization code—in practice it’s a rich source of security vulnerabilities. Deserializers are, in essence, parsers. And they are parsers typically written in an ad hoc style rather than using higher-level tools such as parser generators. Ad-hoc parsers are very tricky to get right, especially when the parsed data might be influenced by an adversary, and especially dangerous when written in a type-unsafe language.

This is why I’ve always believed that to get to secure computing systems, it is essential that we raise the level of abstraction so programmers avoid, to the extent possible, writing code whose purpose is simply to transform data between different representations. Simpler code is easier to reason about and to make secure, and eliminating data transformations avoids opportunities to insert security vulnerabilities.


Java vs. OCaml vs. Scala

I just inadvertently ran a poorly controlled experiment on the relative virtues of these three programming languages, at least for the job of writing compilers. Despite the nonexistent experimental protocol, I thought the results were pretty interesting—even if they may irritate some of my PL colleagues.

In my compilers course at Cornell (CS 4120), I let the student project groups choose whatever language they wanted to implement a complete optimizing compiler that translates an object-oriented language into x86-64 assembly code.

There were 26 project groups. Most of them used Java as the implementation language, following our suggestion. But there were a few bold and enthusiastic groups that chose other languages. Three groups used OCaml and two groups used Scala. The support for Java was slightly better—they were able to use our latest parser generator tech—but I don’t think this was a huge leg up.

Interestingly, the choice of language did not seem to be a big factor. The four groups whose compilers did best on our fairly extensive suite of tests all used Java. I also measured the size of the compiler code in non-comment tokens, since that seems like a better measure of code size than characters or lines. Across all project groups, the average length of the compiler code was 93k tokens, and the four top groups wrote compiler code comprising an average of 97k tokens. (Some of them implemented extra features such as optional optimizations, so it’s not surprising they were a bit longer on average.) On the other hand, the OCaml groups wrote 90k tokens on average; the Scala groups were a bit shorter at 74k tokens. Not the big difference some people might expect.

Obviously, there are a lot of confounding factors here: maybe students have more experience with Java, though almost everyone in the class had taken an OCaml programming course. Perhaps the better IDE support for Java makes a big difference. On the other hand, we might expect that students bold enough to use a different language are stronger programmers on average.

But for all the flak that Java takes, it seems to have served the students in the course well.

Limits of Heroism

There has been much nice work lately on proving that complex software is written correctly, including components like operating systems, compilers. But it’s hard to see how to scale these heroic efforts to the vast amount of software that is being written every day. It’s simply too hard to build software that is correct and secure. Much of the problem arises because of the low level of abstraction at which software is written. We’re fighting a war on which (not to be dramatic!) the future of civilization may depend, and right now we’re only winning scattered battles. The problem is that we’re fighting this war on the wrong terrain—such as the terrain of C and C++ code, where even the most basic security and correctness guarantees are absent. We need new language design research that moves the field of engagement to more favorable terrain, because your strategy for winning a war can’t rely on the existence of heroes.

Strategic voting and the Republican primary

I’ve been interested in voting methods (algorithms for deciding who wins an election) for some time. The standard voting method (plurality) has long been criticized for being subject to vote splitting and other anomalies that cause the results of an election not to correctly represent the consensus opinion of the electorate. From the polling, the problems with plurality voting seem to be manifesting in the current Republican primary. Donald Trump is winning a plurality of the voters, yet when Republican primary voters are presented with a head-to-head, 2-person choice between Donald Trump and any one of the other top candidates, the other candidate wins. It would seem wrong to choose Trump as the candidate if most voters would prefer, say, Ted Cruz over Trump. But that is the outcome produced by plurality voting when a minority of voters prefer Trump and the votes of the remaining voters are split across three candidates. More generally, it seems problematic to choose as the winner any candidate who would lose in a head-to-head contest with another candidate — there is a majority of voters who think that second candidate is better, so most voters will be upset by the choice.

The Marquis de Condorcet identified at least a partial solution to this problem almost three centuries ago. A voting system should elect the Condorcet winner if there is one: that is, the candidate who wins pairwise head-to-head contests against each other candidate. There cannot be a majority of the voters who objects to a Condorcet winner. Condorcet voting methods are those that always elect the Condorcet winner (if there is one, which there usually is).

That’s why I implemented the first usable online Condorcet voting system, CIVS, more than 10 years ago. It gets a fair amount of use, both by open-source software organizations like Ubuntu, the Linux Foundation, and Openstack, and also by a number of universities for their own governance. CIVS is not a perfect system — there are still deep security issues to be addressed in online voting — but more than a decade of experience with Condorcet voting suggests it actually works very well.

Attack in depth

Current computer systems are built in layers. The hardware increasingly has features to support security. Then a hypervisor/VMM is wrapped around the hardware. A guest OS runs on top of the hypervisor. Complex libraries use the OS. And finally the application sits on top of this whole stack.  The current approach to building secure systems is to try to make each layer more solid and secure than the one on top. So we’re seeing significant effort going into hardware verification and verification of VMMs and OSs.

I see two problems with this. First, the more layers you have in your system, the more things that can fail and that can be attacked. The more layers you have, the more layers that can be attacked. Second, the real assets we want to protect are at the application layer. So it’s typically not necessary to subvert the machine all the way to the bottom of the stack. Attack any layer successfully, and you defeat every layer above. So computing platforms may offer security that is more like “attack in depth” than “defense in depth”.

I think all the ongoing work on building verifiably correct and secure systems is great stuff. But the ability to build particular existing layers of software securely doesn’t absolve us of thinking about the overall system architecture and the security that is offered when all the components are put together.

The logic deficit

The dream of building provably correct software seems to be coming closer to reality. It is very cool to see all the recent work on the systems community on building provably correct systems components.

At the same time, I’m worried that the training of software developers actually involves less formal logic compared to a generation ago. There is so much more computer science to teach in college, and formal logic has been squeezed into a smaller fraction of the curriculum. That fact doesn’t bode well for future adoption of formal methods to build more reliable, secure software. Maybe there will be enough highly trained developers to  build core systems components (e.g., hypervisors) to a high standard, but many security vulnerabilities are at the application level.