Seminar Series

## July 23

Add to Calendar 2019-07-23 15:00:00 2019-07-23 16:00:00 America/New_York Speculative Concurrency for Ethereum Smart Contracts Modern cryptocurrency systems, such as Ethereum, permit complex financial transactions through scripts called smart contracts. These smart contracts are executed many, many times, always without real concurrency. Serial execution limits system throughput and fails to exploit today's concurrent multicore and cluster architectures.This talk presents a novel way to permit miners and validators to execute smart contracts in parallel, based on techniques adapted from software transactional memory. Miners execute smart contracts speculatively in parallel, "discovering" a serializable concurrent schedule for a block's transactions, This schedule is captured and encoded as a deterministic fork-join program used by validators to re-execute the miner's parallel schedule deterministically but concurrently.We examined historical data to estimate the potential benefit of speculative techniques for executing Ethereum smart contracts in parallel. We find that our speculative technique yields estimated speed-ups starting at about 8-fold in 2016, declining to about 2-fold at the end of 2017, where speed-up is measured using either gas costs or instruction counts. We also observe that a small set of contracts are responsible for many data conflicts resulting from speculative concurrent execution.Joint work with Thomas Dickerson, Paul Gazzillo, Eric Koskinen, and Vikram Saraph.Livestream link: https://www.youtube.com/channel/UCYs2iUgksAhgoidZwEAimmg/live G32-449 (Patil/Kiva) Belfer sarah_donahue@hks.harvard.edu

## July 30

Add to Calendar 2019-07-30 15:00:00 2019-07-30 16:00:00 America/New_York Advances in Determinacy Race Detection for Task-Parallel Code Abstract: Task parallelism is designed to simplify the job of writing parallel code that can utilize the multicore hardware efficiently. With task parallelism, the programmer expresses the logical parallelism of the computation using high-level parallel control constructs, and lets the the underlying runtime system automates the necessary scheduling and synchronizations. Even with task parallelism, writing correct and efficient parallel code can still be challenging.One of the challenges is to deal with determinacy races, which occur when logically parallel parts of the computation access the same memory location and at least one of the accesses is a write. Determinacy races are generally bugs in the program since they lead to non-deterministic program behavior --- different schedules of the program can lead to different results. Moreover, they are difficult to detect and debug, since a race may manifest itself in one run but not in another.In this talk, I will discuss recent advances in detecting determinacy races for task-parallel code. Most prior work on detecting determinacy races has focused on fork-join parallelism, which generates series-parallel dags, a class of computations that is well structured. We will discuss race detection algorithms that can race detect programs with more general structures.Bio: I-Ting Angelina Lee is an assistant professor in the Computer Science and Engineering department in Washington University in St. Louis. Prior to that, she worked with the Supertech research group in Massachusetts Institute of Technology lead by Professor Charles Leiserson for her graduate study and subsequently as a postdoctoral associate.Dr. Lee's research focuses on advancing software technologies for parallel computing. She is interested in many aspects of parallel computing, including designing programming models and linguistic constructs to simplify parallel programming, developing runtime and operating system support to execute multithreaded programs efficiently, and building software tools to aid debugging and performance engineering of multithreaded code.Livestream link: https://www.youtube.com/channel/UCYs2iUgksAhgoidZwEAimmg/live G32-449 (Patil/Kiva) Belfer sarah_donahue@hks.harvard.edu

## Optimal system settings: How to not lose before you begin

MIT Lincoln Laboratory

#### Part Of

Add to Calendar 2019-08-06 15:00:00 2019-08-06 16:00:00 America/New_York Optimal system settings: How to not lose before you begin Abstract: Most default system configurations (physical and virtual) are set to maximize compatability and maintainability. In such configurations, good performance may be impossible. High performance system settings are rarely documented. In this talk, we will describe how simple benchmarking can be used to determine the optimal settings that are a precondition to getting high performance. Reference: https://arxiv.org/abs/1907.03195 Bio: Dr. Jeremy Kepner is Head and Founder of the MIT Lincoln Laboratory Supercomputing Center (LLSC). He is also a Founder of the MIT-Air Force AI Accelerator. Lincoln Laboratory is a 4000-person National Laboratory whose mission is to create defensive technologies to protect our Nation and the freedoms enshrined in the Constitution of the United States. Dr. Kepner is one of five Lincoln Laboratory Fellows, a position that "recognizes the Laboratory's strongest technical talent for outstanding contributions to Laboratory programs over many years." Dr. Kepner is focused on creating state-of-the-art computing capabilities to enable scientific and engineering breakthroughs to solve the Nation's most challenging problems. He received his PhD in Astrophysics from Princetion University in 1998 (advisor: Prof. David Spergel) and his BA in Astrophysics from Pomona College in 1991.Livestream link: https://www.youtube.com/channel/UCYs2iUgksAhgoidZwEAimmg/live G32-449 (Patil/Kiva) Belfer sarah_donahue@hks.harvard.edu

## August 20

Add to Calendar 2019-08-20 15:00:00 2019-08-20 16:00:00 America/New_York Tapir: Embedding Recursive Fork-Join Parallelism into LLVM&#039;s Intermediate Representation Abstract: Tapir is a compiler intermediate representation (IR) that embeds recursive fork-join parallelism, as supported by task-parallel programming platforms such as Cilk and OpenMP, into a mainstream compiler's IR. Mainstream compilers typically treat parallel linguistic constructs as syntactic sugar for function calls into a parallel runtime. These calls prevent the compiler from performing optimizations on and across parallel control constructs. Remedying this situation has generally been thought to require an extensive reworking of compiler analyses and code transformations to handle parallel semantics. Tapir leverages the "serial-projection property," which is commonly satisfied by task-parallel programs, to handle the semantics of these programs without an extensive rework of the compiler. For recursive fork-join programs that satisfy the serial-projection property, Tapir enables effective compiler optimization of parallel programs with only minor changes to existing compiler analyses and code transformations. Tapir uses the serial-projection property to order logically parallel fine-grained tasks in the program's control-flow graph. This ordered representation of parallel tasks allows the compiler to optimize parallel codes effectively with only minor modifications. For example, to implement Tapir/LLVM, a prototype of Tapir in the LLVM compiler, we added or modified less than 3000 lines of LLVM's half-million-line core middle-end functionality. These changes suffice to enable LLVM's existing compiler optimizations for serial code to work with parallel control constructs such as parallel loops and Cilk's cilk_spawn keyword. Tapir also supports parallel optimizations, such as loop scheduling and loop stripmining, that restructure the parallel control flow of the program. By making use of existing LLVM optimizations and new parallel optimizations, Tapir/LLVM can optimize recursive fork-join programs more effectively than traditional compilation methods. On a suite of 35 Cilk application benchmarks, Tapir/LLVM produces more efficient executables for 30 benchmarks, with faster 18-core running times for 26 of them, compared to a nearly identical compiler that compiles parallel linguistic constructs the traditional way. In addition, by integrating Tapir/LLVM into the Accelerated Linear Algebra (XLA) compiler in Google's TensorFlow machine-learning framework, Tapir/LLVM enables more effective compiler optimization of a variety of neural networks written in TensorFlow, improving the parallel performance of these networks by a geometric-mean multiplicative factor of 30%-100% across a variety of CPU hardware architectures. Bio: Tao B. Schardl is a Research Scientist in the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL). His research combines algorithms and systems to develop technologies that support principled, scientific approaches to writing fast code. He has previously worked on parallel programming models, theories of performance, diagnostic tools, and compilers to simplify the task of software performance engineering. His work on the Tapir/LLVM compiler earned the best paper award at the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming in 2017. Dr. Schardl received his S.B. in Computer Science and Electrical Engineering from MIT in 2009, his M.Eng. in Computer Science and Electrical Engineering from MIT in 2010, and his Ph.D. in Computer Science and Engineering from MIT in 2016.Livestream link: https://www.youtube.com/channel/UCYs2iUgksAhgoidZwEAimmg/live G32-449 (Patil/Kiva) Belfer sarah_donahue@hks.harvard.edu

## August 27

Add to Calendar 2019-08-27 15:00:00 2019-08-27 16:00:00 America/New_York Algorithms and Systems for Processing Massive Static and Evolving Graphs Abstract: Today, the largest publicly available graph dataset is the Hyperlink Web graph, with over 3.5 billion vertices and 128 billion edges. Despite being publicly available for several years now, most experimental work in the literature reports results on much smaller graphs, or resorts to external or distributed memory to process the Hyperlink graph. A natural question then, is to ask whether we can efficiently solve a broad class of problems on this graph in memory. Given that existing systems capable of processing this graph only support static processing, another interesting question is whether we can efficiently represent and update this graph in memory. In the first part of this talk I will present theoretically efficient implementations of parallel graph algorithms for a set of 20 important graph problems. Our codes, which we have publicly open-sourced as part of the Graph Based Benchmark Suite (GBBS) solve these problems on the Hyperlink Web graph using a single commodity multicore machine with a terabyte of RAM, with most of the codes running in under a minute on this graph. The running times of our implementations outperform existing state-of-the-art implementations on the largest real-world graphs, and for many of the problems that we consider, this is the first time they have been solved on graphs at this scale. In the second part of this talk I will present Aspen, a framework for processing streaming graphs, which are graphs that evolve over time via a stream of updates (batches of edge insertions/deletions). The key idea is to represent the adjacency information using a purely-functional compressed tree data structure called a C-tree, which can be efficiently updated. Aspen achieves millisecond latency for processing small batches, and achieves throughputs of hundreds of millions of updates per second for very large batches, while providing strict serializability for both queries and updates. Using Aspen, we are able to concurrently update and run analytics on the Hyperlink Web graph using a single commodity multicore server with a terabyte of RAM. Based on joint work with Guy Blelloch and Julian Shun. Bio: Laxman is a fifth year PhD student at CMU, where he is fortunate to be advised by Guy Blelloch. His graduate work has focused mostly on designing fast and theoretically efficient parallel graph algorithms.Livestream link: https://www.youtube.com/channel/UCYs2iUgksAhgoidZwEAimmg/live G32-449 (Patil/Kiva) Belfer sarah_donahue@hks.harvard.edu

## Domain-Specific Accelerators

NVIDIA Corporation and Stanford University

#### Part Of

Add to Calendar 2019-09-16 14:15:00 2019-09-16 15:15:00 America/New_York Domain-Specific Accelerators Abstract: Increasing computing performance enables new applications and greater value from computing. With the end of Moore's Law and Dennard Scaling, continued performance scaling will come primarily from specialization. Specialized hardware engines can achieve performance and efficiency from 10x to 10,000x a CPU through specialization, parallelism, and optimized memory access. Graphics processing units are an ideal platform on which to build domain-specific accelerators. They provide very efficient, high performance communication and memory subsystems - which are needed by all domains. Specialization is provided via "cores", such as tensor cores that accelerate deep learning or ray-tracing cores that accelerate specific applications. This talk will describe some common characteristics of domain-specific accelerators via case studies. Bio: Bill Dally is Chief Scientist and Senior Vice President of Research at NVIDIA Corporation and a Professor (Research) and former chair of Computer Science at Stanford University. Bill is currently working on developing hardware and software to accelerate demanding applications including machine learning, bioinformatics, and logical inference. He has a history of designing innovative and efficient experimental computing systems. While at Bell Labs Bill contributed to the BELLMAC32 microprocessor and designed the MARS hardware accelerator. At Caltech he designed the MOSSIM Simulation Engine and the Torus Routing Chip which pioneered wormhole routing and virtual-channel flow control. At the Massachusetts Institute of Technology his group built the J-Machine and the M- Machine, experimental parallel computer systems that pioneered the separation of mechanisms from programming models and demonstrated very low overhead synchronization and communication mechanisms. At Stanford University his group developed the Imagine processor, which introduced the concepts of stream processing and partitioned register organizations, the Merrimac supercomputer, which led to GPU computing, and the ELM low-power processor. Bill is a Member of the National Academy of Engineering, a Fellow of the IEEE, a Fellow of the ACM, and a Fellow of the American Academy of Arts and Sciences. He has received the ACM Eckert-Mauchly Award, the IEEE Seymour Cray Award, the ACM Maurice Wilkes award, the IEEE-CS Charles Babbage Award, and the IPSJ FUNAI Achievement Award. He currently leads projects on computer architecture, network architecture, circuit design, and programming systems. He has published over 250 papers in these areas, holds over 160 issued patents, and is an author of the textbooks, Digital Design: A Systems Approach, Digital Systems Engineering, and Principles and Practices of Interconnection Networks.Livestream: https://www.youtube.com/channel/UCYs2iUgksAhgoidZwEAimmg/live G32-449 (Patil/Kiva) Belfer sarah_donahue@hks.harvard.edu

## September 23

Add to Calendar 2019-09-23 14:00:00 2019-09-23 15:00:00 America/New_York Tiramisu: A Polyhedral Compiler for Dense and Sparse Deep Learning Abstract: Tiramisu is a polyhedral compiler for deep learning. It has two unique features: (1) it is the first sparse DNN compiler; and (2) it can express and optimize general RNNs (Recurrent Neural Networks). Tiramisu relies on a flexible representation based on the polyhedral model and has a rich scheduling language allowing fine-grained control of optimizations. We show that Tiramisu matches or outperforms Intel MKL-DNN by up to 5x for sparse DNNs. We also show that Tiramisu allows many new capabilities such as wavefront parallelization for RNNs.Bio: Riyadh Baghdadi is a postdoctoral associate at CSAIL. He got his PhD from ENS Paris. He works on the area of compiler optimization and code generation for high performance architectures. Recently, he has been interested in exploring the intersection of compilers and deep learning. 32-D463 (Star) Belfer sarah_donahue@hks.harvard.edu

## Julia: Making dynamic programs run fast

Valentin Churavy
MIT CSAIL

#### Part Of

Add to Calendar 2019-09-30 14:00:00 2019-09-30 15:00:00 America/New_York Julia: Making dynamic programs run fast Abstract: Julia is a general purpose language that utilizes type inference combined with a Just-In-Time (JIT) compiler, built upon LLVM, to achieve C-like performance while retaining the dynamic semantics of interactive languages like MATLAB, Python, and R. This talk will briefly introduce Julia, explore how it is possible for a dynamic language to be fast and to how Julia is making use of LLVM. We will focus on interacting with the compiler infrastructure and explore how to Julia targets accelerators like NVIDIA GPUs.Bio: Valentin Churavy is a PhD student at CSAIL. His research at the JuliaLab focuses on heterogeneous distributed computing for high-performance computing. He is interested in human-compiler interaction and how to help people write fast code in a high-level dynamic language. 32-D463 (Star) Belfer sarah_donahue@hks.harvard.edu

## SplinterDB: Closing the Bandwidth Gap on NVMe

Rutgers University

#### Part Of

Add to Calendar 2019-10-21 14:00:00 2019-10-21 15:00:00 America/New_York SplinterDB: Closing the Bandwidth Gap on NVMe Abstract: Key-value stores are a fundamental component to storage systems. Recent advances in storage hardware, especially the advent of NVMe devices, expose new bottlenecks in the design of current key-value stores. We introduce SplinterDB and show how it achieves a 6-9x improvement in the insertion rate over Facebook's RocksDB on NVMe devices, while also improving query performance by 2-2.5x.Bio: Alex Conway is a PhD student in Computer Science at Rutgers University, advised by Martin Farach-Colton. His research is focused on external memory theory and storage systems. His recent publications include work on high performance key-value stores, file system aging, external memory hash tables and the analysis of storage optimization heuristics.He is a member of the BetrFS team. BetrFS, a prototype file system, is designed around the principles of write-optimization and is built using variants of B\epsilon-trees. G32-449 (Patil/Kiva) Belfer sarah_donahue@hks.harvard.edu

## November 04

Add to Calendar 2019-11-04 14:00:00 2019-11-04 15:00:00 America/New_York GraphIt: A Domain-Specific Language for Writing High-Performance Graph Applications Abstract: In recent years, large graphs with billions of vertices and trillions of edges have emerged in many domains, such as social network analytics, machine learning, physical simulations, and biology. There is a strong demand for a programming system that allows domain experts to easily write applications that run quickly on these large graphs. However, programming high-performance graph applications is a challenging problem. The performance bottlenecks of graph applications depend on the algorithm, the size and structure of the graph, and the underlying hardware. There is not a set of optimizations that work well across all algorithms, graphs, and hardwares. Additionally, domain experts are often not familiar with low-level implementation details, such as atomic instructions and bit manipulations, needed for today's high-performance graph frameworks.In this talk, I will present our work on GraphIt, a new domain-specific language (DSL) that offers an easy-to-use programming model, achieves consistent high-performance across different algorithms and graphs, and supports both unordered and ordered graph algorithms. GraphIt decouples algorithms from performance optimizations (schedules) for graph applications to make it easy to explore a large space of performance optimizations. The compiler leverages a new graph iteration space model to generate fast code for different combinations of optimizations. We also introduce new priority-based extensions to support ordered graph algorithms and are working on code generation for GPU architectures.Bio: Yunming Zhang is a sixth year PhD student at MIT working with Saman Amarasinghe and Julian Shun. He works on domain-specific compilers, languages and performance optimizations for large-scale data analytics applications. 32-G449 (Kiva) Belfer sarah_donahue@hks.harvard.edu

## How to Modernize Compiler Technology

MIT CSAIL

#### Part Of

Add to Calendar 2019-11-18 14:00:00 2019-11-18 15:00:00 America/New_York How to Modernize Compiler Technology Abstract: In this talk, I will show the path to the modernization of one important compiler technique -- vectorization. Vectorization was first introduced in the era of Cray vector processors during the 1980's. While vector supercomputers needed large vectors, which are mainly available by parallelizing loops, modern SIMD instructions efficiently work on short vectors. Thus, in 2000, Larsen et. al. introduced a new technique known as Superword Level Parallelism (SLP) based vectorization to fill this gap. First, I will show how we can take advantage of the power of modern computers for compilation, by using more accurate but expensive techniques to improve SLP vectorization. Due to the hardware resource constraints of the era, like many other compiler optimizations, SLP implementation was a greedy algorithm. In 2018, we introduced goSLP, which uses integer linear programming to find an optimal instruction packing strategy. Next, I will show how to truly modernize a compiler by automatically learning the necessary components of the compiler with Ithemal and Vemal. The actual cost of execution in a modern out-of-order, superscalar processor is much more complex than a linear model. Manually building such cost models as well as manually developing compiler optimizations is costly, tedious and is hard to keep up with the architectural changes. Ithemal is the first learnt cost model for predicting the throughput of x86 basic blocks with minimal human effort. Vemal is the first learnt policy for end-to-end vectorization as opposed to tuning heuristics. These data-driven techniques achieve state-of-the-art results while also reducing the development and maintenance burden of the compiler developer. 32-G449 Kiva/Patil Belfer sarah_donahue@hks.harvard.edu

## A Speed-of-Light Internet Service Provider

Duke University and Emerald Innovations

## Program Optimization for Machine Learning

Stanford University

#### Part Of

Add to Calendar 2020-05-11 14:00:00 2020-05-11 15:00:00 America/New_York Program Optimization for Machine Learning Abstract: Training deep neural networks (DNNs) can be expensive and slow, consuming enormous numbers of compute-hours on parallel machines. This talk will present results on using novel search procedures over programs to reduce training time. In particular, instead of greedily applying program-improving transformations to compute a single improved program, we search a space of programs, considering many possible candidates guided by a global cost function. Application of search-based optimization to two separate problems will be discussed: improving the partitioning and distribution of training data, and reducing the execution time of the DNN computation graph. Both methods speedup training by up to a factor of 3 over current state-of-the-art systems.Bio: Alex Aiken is the Alcatel-Lucent Professor of Computer Science at Stanford. Alex received his Bachelors degree in Computer Science and Music from Bowling Green State University in 1983 and his Ph.D. from Cornell University in 1988. Alex was a Research Staff Member at the IBM Almaden Research Center (1988-1993) and a Professor in the EECS department at UC Berkeley (1993-2003) before joining the Stanford faculty in 2003. His research interest is in areas related to programming languages. He is an ACM Fellow, a recipient of ACM SIGPLAN's Programming Languages Achievement Award and Phi Beta Kappa's Teaching Award, and a former chair of the Stanford Computer Science Department. https://mit.zoom.us/j/536883569 Password 008943 Belfer sarah_donahue@hks.harvard.edu

## Dynamic Data Structures on the GPU

UC Davis

#### Part Of

Add to Calendar 2020-06-01 14:00:00 2020-06-01 15:00:00 America/New_York Dynamic Data Structures on the GPU Abstract: Dynamic data structures allow updates to the data structure without having to rebuild it completely. I'll discuss five dynamic GPU data structures that we designed and built - log-structured merge trees, quotient filters, linked lists and hash tables built atop them, dynamic graphs, and B-trees. I'll talk about principles that we followed in building them and what we learned from the experience.Bio: John Owens is a professor of electrical and computer engineering at the University of California, Davis, where he leads an amazing group of graduate students in a research program centered around parallel computing on the graphics processing unit (GPU). He earned his PhD from Stanford in 2003 and his BS from UC Berkeley in 1995. In January 2020 he bought his kids MIT sweatshirts from the Coop while helping to run the Mystery Hunt; his kids wear them often. https://mit.zoom.us/j/536883569 Password 008943 Belfer sarah_donahue@hks.harvard.edu

## Solving Global Grand Challenges with High Performance Data Analytics

New Jersey Institute of Technology

#### Part Of

Add to Calendar 2020-06-08 14:00:00 2020-06-08 15:00:00 America/New_York Solving Global Grand Challenges with High Performance Data Analytics Abstract:Data science aims to solve grand global challenges such as: detecting and preventing disease in human populations; revealing community structure in large social networks; protecting our elections from cyber-threats, and improving the resilience of the electric power grid. Unlike traditional applications in computational science and engineering, solving these social problems at scale often raises new challenges because of the sparsity and lack of locality in the data, the need for research on scalable algorithms and architectures, and development of frameworks for solving these real-world problems on high performance computers, and for improved models that capture the noise and bias inherent in the torrential data streams. In this talk, Bader will discuss the opportunities and challenges in massive data science for applications in social sciences, physical sciences, and engineering.Biography:David A. Bader is a Distinguished Professor in the Department of Computer Science at New Jersey Institute of Technology. Prior to this, he served as founding Professor and Chair of the School of Computational Science and Engineering, College of Computing, at Georgia Institute of Technology. He is a Fellow of the IEEE, AAAS, and SIAM, and advises the White House, most recently on the National Strategic Computing Initiative (NSCI). Bader serves on the leadership team of Northeast Big Data Innovation Hub as the inaugural chair of the Seed Fund Steering Committee. Dr. Bader is a leading expert in solving global grand challenges in science, engineering, computing, and data science. His interests are at the intersection of high-performance computing and real-world applications, including cybersecurity, massive-scale analytics, and computational genomics, and he has co-authored over 250 scholarly papers. Dr. Bader has served as a lead scientist in several DARPA programs including High Productivity Computing Systems (HPCS) with IBM, Ubiquitous High Performance Computing (UHPC) with NVIDIA, Anomaly Detection at Multiple Scales (ADAMS), Power Efficiency Revolution For Embedded Computing Technologies (PERFECT), Hierarchical Identify Verify Exploit (HIVE), and Software-Defined Hardware (SDH). Bader is Editor-in-Chief of the ACM Transactions on Parallel Computing, and will serve as General Co-Chair of IPDPS 2021. He has also served as Director of the Sony-Toshiba-IBM Center of Competence for the Cell Broadband Engine Processor. Bader is a cofounder of the Graph500 List for benchmarking "Big Data" computing platforms. Bader is recognized as a "RockStar" of High Performance Computing by InsideHPC and as HPCwire's People to Watch in 2012 and 2014. Recently, Bader received an NVIDIA AI Lab (NVAIL) award (2019), and a Facebook Research AI Hardware/Software Co-Design award (2019). https://mit.zoom.us/j/536883569 Password 008943 Belfer sarah_donahue@hks.harvard.edu