July 23

Add to Calendar 2019-07-23 15:00:00 2019-07-23 16:00:00 America/New_York Speculative Concurrency for Ethereum Smart Contracts Modern cryptocurrency systems, such as Ethereum, permit complex financial transactions through scripts called smart contracts. These smart contracts are executed many, many times, always without real concurrency. Serial execution limits system throughput and fails to exploit today's concurrent multicore and cluster architectures.This talk presents a novel way to permit miners and validators to execute smart contracts in parallel, based on techniques adapted from software transactional memory. Miners execute smart contracts speculatively in parallel, "discovering" a serializable concurrent schedule for a block's transactions, This schedule is captured and encoded as a deterministic fork-join program used by validators to re-execute the miner's parallel schedule deterministically but concurrently.We examined historical data to estimate the potential benefit of speculative techniques for executing Ethereum smart contracts in parallel. We find that our speculative technique yields estimated speed-ups starting at about 8-fold in 2016, declining to about 2-fold at the end of 2017, where speed-up is measured using either gas costs or instruction counts. We also observe that a small set of contracts are responsible for many data conflicts resulting from speculative concurrent execution.Joint work with Thomas Dickerson, Paul Gazzillo, Eric Koskinen, and Vikram Saraph.Livestream link: https://www.youtube.com/channel/UCYs2iUgksAhgoidZwEAimmg/live G32-449 (Patil/Kiva) Belfer sarah_donahue@hks.harvard.edu

July 30

Add to Calendar 2019-07-30 15:00:00 2019-07-30 16:00:00 America/New_York Advances in Determinacy Race Detection for Task-Parallel Code Abstract: Task parallelism is designed to simplify the job of writing parallel code that can utilize the multicore hardware efficiently. With task parallelism, the programmer expresses the logical parallelism of the computation using high-level parallel control constructs, and lets the the underlying runtime system automates the necessary scheduling and synchronizations. Even with task parallelism, writing correct and efficient parallel code can still be challenging.One of the challenges is to deal with determinacy races, which occur when logically parallel parts of the computation access the same memory location and at least one of the accesses is a write. Determinacy races are generally bugs in the program since they lead to non-deterministic program behavior --- different schedules of the program can lead to different results. Moreover, they are difficult to detect and debug, since a race may manifest itself in one run but not in another.In this talk, I will discuss recent advances in detecting determinacy races for task-parallel code. Most prior work on detecting determinacy races has focused on fork-join parallelism, which generates series-parallel dags, a class of computations that is well structured. We will discuss race detection algorithms that can race detect programs with more general structures.Bio: I-Ting Angelina Lee is an assistant professor in the Computer Science and Engineering department in Washington University in St. Louis. Prior to that, she worked with the Supertech research group in Massachusetts Institute of Technology lead by Professor Charles Leiserson for her graduate study and subsequently as a postdoctoral associate.Dr. Lee's research focuses on advancing software technologies for parallel computing. She is interested in many aspects of parallel computing, including designing programming models and linguistic constructs to simplify parallel programming, developing runtime and operating system support to execute multithreaded programs efficiently, and building software tools to aid debugging and performance engineering of multithreaded code.Livestream link: https://www.youtube.com/channel/UCYs2iUgksAhgoidZwEAimmg/live G32-449 (Patil/Kiva) Belfer sarah_donahue@hks.harvard.edu

August 06

Add to Calendar 2019-08-06 15:00:00 2019-08-06 16:00:00 America/New_York Optimal system settings: How to not lose before you begin Abstract: Most default system configurations (physical and virtual) are set to maximize compatability and maintainability. In such configurations, good performance may be impossible. High performance system settings are rarely documented. In this talk, we will describe how simple benchmarking can be used to determine the optimal settings that are a precondition to getting high performance. Reference: https://arxiv.org/abs/1907.03195 Bio: Dr. Jeremy Kepner is Head and Founder of the MIT Lincoln Laboratory Supercomputing Center (LLSC). He is also a Founder of the MIT-Air Force AI Accelerator. Lincoln Laboratory is a 4000-person National Laboratory whose mission is to create defensive technologies to protect our Nation and the freedoms enshrined in the Constitution of the United States. Dr. Kepner is one of five Lincoln Laboratory Fellows, a position that "recognizes the Laboratory's strongest technical talent for outstanding contributions to Laboratory programs over many years." Dr. Kepner is focused on creating state-of-the-art computing capabilities to enable scientific and engineering breakthroughs to solve the Nation's most challenging problems. He received his PhD in Astrophysics from Princetion University in 1998 (advisor: Prof. David Spergel) and his BA in Astrophysics from Pomona College in 1991.Livestream link: https://www.youtube.com/channel/UCYs2iUgksAhgoidZwEAimmg/live G32-449 (Patil/Kiva) Belfer sarah_donahue@hks.harvard.edu

August 20

Add to Calendar 2019-08-20 15:00:00 2019-08-20 16:00:00 America/New_York Tapir: Embedding Recursive Fork-Join Parallelism into LLVM's Intermediate Representation Abstract: Tapir is a compiler intermediate representation (IR) that embeds recursive fork-join parallelism, as supported by task-parallel programming platforms such as Cilk and OpenMP, into a mainstream compiler's IR. Mainstream compilers typically treat parallel linguistic constructs as syntactic sugar for function calls into a parallel runtime. These calls prevent the compiler from performing optimizations on and across parallel control constructs. Remedying this situation has generally been thought to require an extensive reworking of compiler analyses and code transformations to handle parallel semantics. Tapir leverages the "serial-projection property," which is commonly satisfied by task-parallel programs, to handle the semantics of these programs without an extensive rework of the compiler. For recursive fork-join programs that satisfy the serial-projection property, Tapir enables effective compiler optimization of parallel programs with only minor changes to existing compiler analyses and code transformations. Tapir uses the serial-projection property to order logically parallel fine-grained tasks in the program's control-flow graph. This ordered representation of parallel tasks allows the compiler to optimize parallel codes effectively with only minor modifications. For example, to implement Tapir/LLVM, a prototype of Tapir in the LLVM compiler, we added or modified less than 3000 lines of LLVM's half-million-line core middle-end functionality. These changes suffice to enable LLVM's existing compiler optimizations for serial code to work with parallel control constructs such as parallel loops and Cilk's cilk_spawn keyword. Tapir also supports parallel optimizations, such as loop scheduling and loop stripmining, that restructure the parallel control flow of the program. By making use of existing LLVM optimizations and new parallel optimizations, Tapir/LLVM can optimize recursive fork-join programs more effectively than traditional compilation methods. On a suite of 35 Cilk application benchmarks, Tapir/LLVM produces more efficient executables for 30 benchmarks, with faster 18-core running times for 26 of them, compared to a nearly identical compiler that compiles parallel linguistic constructs the traditional way. In addition, by integrating Tapir/LLVM into the Accelerated Linear Algebra (XLA) compiler in Google's TensorFlow machine-learning framework, Tapir/LLVM enables more effective compiler optimization of a variety of neural networks written in TensorFlow, improving the parallel performance of these networks by a geometric-mean multiplicative factor of 30%-100% across a variety of CPU hardware architectures. Bio: Tao B. Schardl is a Research Scientist in the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL). His research combines algorithms and systems to develop technologies that support principled, scientific approaches to writing fast code. He has previously worked on parallel programming models, theories of performance, diagnostic tools, and compilers to simplify the task of software performance engineering. His work on the Tapir/LLVM compiler earned the best paper award at the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming in 2017. Dr. Schardl received his S.B. in Computer Science and Electrical Engineering from MIT in 2009, his M.Eng. in Computer Science and Electrical Engineering from MIT in 2010, and his Ph.D. in Computer Science and Engineering from MIT in 2016.Livestream link: https://www.youtube.com/channel/UCYs2iUgksAhgoidZwEAimmg/live G32-449 (Patil/Kiva) Belfer sarah_donahue@hks.harvard.edu

August 27

Add to Calendar 2019-08-27 15:00:00 2019-08-27 16:00:00 America/New_York Algorithms and Systems for Processing Massive Static and Evolving Graphs Abstract: Today, the largest publicly available graph dataset is the Hyperlink Web graph, with over 3.5 billion vertices and 128 billion edges. Despite being publicly available for several years now, most experimental work in the literature reports results on much smaller graphs, or resorts to external or distributed memory to process the Hyperlink graph. A natural question then, is to ask whether we can efficiently solve a broad class of problems on this graph in memory. Given that existing systems capable of processing this graph only support static processing, another interesting question is whether we can efficiently represent and update this graph in memory. In the first part of this talk I will present theoretically efficient implementations of parallel graph algorithms for a set of 20 important graph problems. Our codes, which we have publicly open-sourced as part of the Graph Based Benchmark Suite (GBBS) solve these problems on the Hyperlink Web graph using a single commodity multicore machine with a terabyte of RAM, with most of the codes running in under a minute on this graph. The running times of our implementations outperform existing state-of-the-art implementations on the largest real-world graphs, and for many of the problems that we consider, this is the first time they have been solved on graphs at this scale. In the second part of this talk I will present Aspen, a framework for processing streaming graphs, which are graphs that evolve over time via a stream of updates (batches of edge insertions/deletions). The key idea is to represent the adjacency information using a purely-functional compressed tree data structure called a C-tree, which can be efficiently updated. Aspen achieves millisecond latency for processing small batches, and achieves throughputs of hundreds of millions of updates per second for very large batches, while providing strict serializability for both queries and updates. Using Aspen, we are able to concurrently update and run analytics on the Hyperlink Web graph using a single commodity multicore server with a terabyte of RAM. Based on joint work with Guy Blelloch and Julian Shun. Bio: Laxman is a fifth year PhD student at CMU, where he is fortunate to be advised by Guy Blelloch. His graduate work has focused mostly on designing fast and theoretically efficient parallel graph algorithms.Livestream link: https://www.youtube.com/channel/UCYs2iUgksAhgoidZwEAimmg/live G32-449 (Patil/Kiva) Belfer sarah_donahue@hks.harvard.edu

September 16

Domain-Specific Accelerators

NVIDIA Corporation and Stanford University
Add to Calendar 2019-09-16 14:15:00 2019-09-16 15:15:00 America/New_York Domain-Specific Accelerators Abstract: Increasing computing performance enables new applications and greater value from computing. With the end of Moore's Law and Dennard Scaling, continued performance scaling will come primarily from specialization. Specialized hardware engines can achieve performance and efficiency from 10x to 10,000x a CPU through specialization, parallelism, and optimized memory access. Graphics processing units are an ideal platform on which to build domain-specific accelerators. They provide very efficient, high performance communication and memory subsystems - which are needed by all domains. Specialization is provided via "cores", such as tensor cores that accelerate deep learning or ray-tracing cores that accelerate specific applications. This talk will describe some common characteristics of domain-specific accelerators via case studies. Bio: Bill Dally is Chief Scientist and Senior Vice President of Research at NVIDIA Corporation and a Professor (Research) and former chair of Computer Science at Stanford University. Bill is currently working on developing hardware and software to accelerate demanding applications including machine learning, bioinformatics, and logical inference. He has a history of designing innovative and efficient experimental computing systems. While at Bell Labs Bill contributed to the BELLMAC32 microprocessor and designed the MARS hardware accelerator. At Caltech he designed the MOSSIM Simulation Engine and the Torus Routing Chip which pioneered wormhole routing and virtual-channel flow control. At the Massachusetts Institute of Technology his group built the J-Machine and the M- Machine, experimental parallel computer systems that pioneered the separation of mechanisms from programming models and demonstrated very low overhead synchronization and communication mechanisms. At Stanford University his group developed the Imagine processor, which introduced the concepts of stream processing and partitioned register organizations, the Merrimac supercomputer, which led to GPU computing, and the ELM low-power processor. Bill is a Member of the National Academy of Engineering, a Fellow of the IEEE, a Fellow of the ACM, and a Fellow of the American Academy of Arts and Sciences. He has received the ACM Eckert-Mauchly Award, the IEEE Seymour Cray Award, the ACM Maurice Wilkes award, the IEEE-CS Charles Babbage Award, and the IPSJ FUNAI Achievement Award. He currently leads projects on computer architecture, network architecture, circuit design, and programming systems. He has published over 250 papers in these areas, holds over 160 issued patents, and is an author of the textbooks, Digital Design: A Systems Approach, Digital Systems Engineering, and Principles and Practices of Interconnection Networks.Livestream: https://www.youtube.com/channel/UCYs2iUgksAhgoidZwEAimmg/live G32-449 (Patil/Kiva) Belfer sarah_donahue@hks.harvard.edu

September 23

Add to Calendar 2019-09-23 14:00:00 2019-09-23 15:00:00 America/New_York Tiramisu: A Polyhedral Compiler for Dense and Sparse Deep Learning Abstract: Tiramisu is a polyhedral compiler for deep learning. It has two unique features: (1) it is the first sparse DNN compiler; and (2) it can express and optimize general RNNs (Recurrent Neural Networks). Tiramisu relies on a flexible representation based on the polyhedral model and has a rich scheduling language allowing fine-grained control of optimizations. We show that Tiramisu matches or outperforms Intel MKL-DNN by up to 5x for sparse DNNs. We also show that Tiramisu allows many new capabilities such as wavefront parallelization for RNNs.Bio: Riyadh Baghdadi is a postdoctoral associate at CSAIL. He got his PhD from ENS Paris. He works on the area of compiler optimization and code generation for high performance architectures. Recently, he has been interested in exploring the intersection of compilers and deep learning. 32-D463 (Star) Belfer sarah_donahue@hks.harvard.edu

September 30

Add to Calendar 2019-09-30 14:00:00 2019-09-30 15:00:00 America/New_York Julia: Making dynamic programs run fast Abstract: Julia is a general purpose language that utilizes type inference combined with a Just-In-Time (JIT) compiler, built upon LLVM, to achieve C-like performance while retaining the dynamic semantics of interactive languages like MATLAB, Python, and R. This talk will briefly introduce Julia, explore how it is possible for a dynamic language to be fast and to how Julia is making use of LLVM. We will focus on interacting with the compiler infrastructure and explore how to Julia targets accelerators like NVIDIA GPUs.Bio: Valentin Churavy is a PhD student at CSAIL. His research at the JuliaLab focuses on heterogeneous distributed computing for high-performance computing. He is interested in human-compiler interaction and how to help people write fast code in a high-level dynamic language. 32-D463 (Star) Belfer sarah_donahue@hks.harvard.edu

October 21

Add to Calendar 2019-10-21 14:00:00 2019-10-21 15:00:00 America/New_York SplinterDB: Closing the Bandwidth Gap on NVMe Abstract: Key-value stores are a fundamental component to storage systems. Recent advances in storage hardware, especially the advent of NVMe devices, expose new bottlenecks in the design of current key-value stores. We introduce SplinterDB and show how it achieves a 6-9x improvement in the insertion rate over Facebook's RocksDB on NVMe devices, while also improving query performance by 2-2.5x.Bio: Alex Conway is a PhD student in Computer Science at Rutgers University, advised by Martin Farach-Colton. His research is focused on external memory theory and storage systems. His recent publications include work on high performance key-value stores, file system aging, external memory hash tables and the analysis of storage optimization heuristics.He is a member of the BetrFS team. BetrFS, a prototype file system, is designed around the principles of write-optimization and is built using variants of B\epsilon-trees. G32-449 (Patil/Kiva) Belfer sarah_donahue@hks.harvard.edu