Abstract: Surrogate programming, the act of replacing programs with surrogate models of their behavior, is being increasingly leveraged to solve software development challenges. Surrogates are typically machine learning models trained on input-output examples of the program under consideration. With surrogate compilation, programmers train a surrogate that replicates the behavior of the original program to deploy to end-users in place of the original program, with the goal of improving the performance of the program. With surrogate adaptation, programmers first train a surrogate of a program then continue to train the surrogate data from a downstream task, with the goal of improving the accuracy of the surrogate on the downstream task. With surrogate optimization, programmers train a surrogate of a program then use the surrogate to optimize the program's inputs, with the goal of optimizing inputs more efficiently than with the original program.
These emerging design patterns represent an important new frontier of software development. However, we lack a coherent understanding of the applications and methodology underlying surrogate programming.
In this thesis I investigate three hypotheses about surrogate programming: that surrogate programming can achieve state-of-the-art results on complex, large-scale programming tasks; that surrogate programming consists of a small set of methodologically distinct design patterns, unifying uses of surrogates in the literature; and that we can guide surrogate architecture and training using facts derived from the modeled program to train surrogates more efficiently and achieve better performance on downstream tasks.
To argue these hypotheses, I present four sets of contributions. First, I study the applications of surrogate programming. I first present DiffTune, a surrogate optimization based approach to tuning the parameters of a large-scale CPU simulator in LLVM. I then generalize this approach and identify, define, and formalize the three design patterns above, and discuss the unified methodology underlying them. Second, I study the methodology of creating a surrogate of a program, showing that we can use the program's source code to guide the architecture and training of the surrogate. I present Renamer, a class of neural network architecture which mirrors a common property of programs, invariance to variable renaming in program inputs. I then present Turaco, a program analysis which identifies the complexity of learning the function applied to each region of a program's input space and a method for using this information to guide training data sampling for the surrogate.
Surrogate programming has the potential to change how developers program large-scale computer systems, by abstracting away much of the complexity to machine learning algorithms. Together, the contributions in my thesis lay the groundwork for a principled understanding of the applications and methodology of surrogate programming.
Thesis Committee: Armando Solar-Lezama, Martin Rinard, Michael Carbin