July 23

Add to Calendar 2019-07-23 15:00:00 2019-07-23 16:00:00 America/New_York Speculative Concurrency for Ethereum Smart Contracts Modern cryptocurrency systems, such as Ethereum, permit complex financial transactions through scripts called smart contracts. These smart contracts are executed many, many times, always without real concurrency. Serial execution limits system throughput and fails to exploit today's concurrent multicore and cluster architectures.This talk presents a novel way to permit miners and validators to execute smart contracts in parallel, based on techniques adapted from software transactional memory. Miners execute smart contracts speculatively in parallel, "discovering" a serializable concurrent schedule for a block's transactions, This schedule is captured and encoded as a deterministic fork-join program used by validators to re-execute the miner's parallel schedule deterministically but concurrently.We examined historical data to estimate the potential benefit of speculative techniques for executing Ethereum smart contracts in parallel. We find that our speculative technique yields estimated speed-ups starting at about 8-fold in 2016, declining to about 2-fold at the end of 2017, where speed-up is measured using either gas costs or instruction counts. We also observe that a small set of contracts are responsible for many data conflicts resulting from speculative concurrent execution.Joint work with Thomas Dickerson, Paul Gazzillo, Eric Koskinen, and Vikram Saraph.Livestream link: https://www.youtube.com/channel/UCYs2iUgksAhgoidZwEAimmg/live G32-449 (Patil/Kiva) Belfer sarah_donahue@hks.harvard.edu

July 30

Add to Calendar 2019-07-30 15:00:00 2019-07-30 16:00:00 America/New_York Advances in Determinacy Race Detection for Task-Parallel Code Abstract: Task parallelism is designed to simplify the job of writing parallel code that can utilize the multicore hardware efficiently. With task parallelism, the programmer expresses the logical parallelism of the computation using high-level parallel control constructs, and lets the the underlying runtime system automates the necessary scheduling and synchronizations. Even with task parallelism, writing correct and efficient parallel code can still be challenging.One of the challenges is to deal with determinacy races, which occur when logically parallel parts of the computation access the same memory location and at least one of the accesses is a write. Determinacy races are generally bugs in the program since they lead to non-deterministic program behavior --- different schedules of the program can lead to different results. Moreover, they are difficult to detect and debug, since a race may manifest itself in one run but not in another.In this talk, I will discuss recent advances in detecting determinacy races for task-parallel code. Most prior work on detecting determinacy races has focused on fork-join parallelism, which generates series-parallel dags, a class of computations that is well structured. We will discuss race detection algorithms that can race detect programs with more general structures.Bio: I-Ting Angelina Lee is an assistant professor in the Computer Science and Engineering department in Washington University in St. Louis. Prior to that, she worked with the Supertech research group in Massachusetts Institute of Technology lead by Professor Charles Leiserson for her graduate study and subsequently as a postdoctoral associate.Dr. Lee's research focuses on advancing software technologies for parallel computing. She is interested in many aspects of parallel computing, including designing programming models and linguistic constructs to simplify parallel programming, developing runtime and operating system support to execute multithreaded programs efficiently, and building software tools to aid debugging and performance engineering of multithreaded code.Livestream link: https://www.youtube.com/channel/UCYs2iUgksAhgoidZwEAimmg/live G32-449 (Patil/Kiva) Belfer sarah_donahue@hks.harvard.edu

August 06

Add to Calendar 2019-08-06 15:00:00 2019-08-06 16:00:00 America/New_York Optimal system settings: How to not lose before you begin Abstract: Most default system configurations (physical and virtual) are set to maximize compatability and maintainability. In such configurations, good performance may be impossible. High performance system settings are rarely documented. In this talk, we will describe how simple benchmarking can be used to determine the optimal settings that are a precondition to getting high performance. Reference: https://arxiv.org/abs/1907.03195 Bio: Dr. Jeremy Kepner is Head and Founder of the MIT Lincoln Laboratory Supercomputing Center (LLSC). He is also a Founder of the MIT-Air Force AI Accelerator. Lincoln Laboratory is a 4000-person National Laboratory whose mission is to create defensive technologies to protect our Nation and the freedoms enshrined in the Constitution of the United States. Dr. Kepner is one of five Lincoln Laboratory Fellows, a position that "recognizes the Laboratory's strongest technical talent for outstanding contributions to Laboratory programs over many years." Dr. Kepner is focused on creating state-of-the-art computing capabilities to enable scientific and engineering breakthroughs to solve the Nation's most challenging problems. He received his PhD in Astrophysics from Princetion University in 1998 (advisor: Prof. David Spergel) and his BA in Astrophysics from Pomona College in 1991.Livestream link: https://www.youtube.com/channel/UCYs2iUgksAhgoidZwEAimmg/live G32-449 (Patil/Kiva) Belfer sarah_donahue@hks.harvard.edu

August 20

Add to Calendar 2019-08-20 15:00:00 2019-08-20 16:00:00 America/New_York Tapir: Embedding Recursive Fork-Join Parallelism into LLVM's Intermediate Representation Abstract: Tapir is a compiler intermediate representation (IR) that embeds recursive fork-join parallelism, as supported by task-parallel programming platforms such as Cilk and OpenMP, into a mainstream compiler's IR. Mainstream compilers typically treat parallel linguistic constructs as syntactic sugar for function calls into a parallel runtime. These calls prevent the compiler from performing optimizations on and across parallel control constructs. Remedying this situation has generally been thought to require an extensive reworking of compiler analyses and code transformations to handle parallel semantics. Tapir leverages the "serial-projection property," which is commonly satisfied by task-parallel programs, to handle the semantics of these programs without an extensive rework of the compiler. For recursive fork-join programs that satisfy the serial-projection property, Tapir enables effective compiler optimization of parallel programs with only minor changes to existing compiler analyses and code transformations. Tapir uses the serial-projection property to order logically parallel fine-grained tasks in the program's control-flow graph. This ordered representation of parallel tasks allows the compiler to optimize parallel codes effectively with only minor modifications. For example, to implement Tapir/LLVM, a prototype of Tapir in the LLVM compiler, we added or modified less than 3000 lines of LLVM's half-million-line core middle-end functionality. These changes suffice to enable LLVM's existing compiler optimizations for serial code to work with parallel control constructs such as parallel loops and Cilk's cilk_spawn keyword. Tapir also supports parallel optimizations, such as loop scheduling and loop stripmining, that restructure the parallel control flow of the program. By making use of existing LLVM optimizations and new parallel optimizations, Tapir/LLVM can optimize recursive fork-join programs more effectively than traditional compilation methods. On a suite of 35 Cilk application benchmarks, Tapir/LLVM produces more efficient executables for 30 benchmarks, with faster 18-core running times for 26 of them, compared to a nearly identical compiler that compiles parallel linguistic constructs the traditional way. In addition, by integrating Tapir/LLVM into the Accelerated Linear Algebra (XLA) compiler in Google's TensorFlow machine-learning framework, Tapir/LLVM enables more effective compiler optimization of a variety of neural networks written in TensorFlow, improving the parallel performance of these networks by a geometric-mean multiplicative factor of 30%-100% across a variety of CPU hardware architectures. Bio: Tao B. Schardl is a Research Scientist in the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL). His research combines algorithms and systems to develop technologies that support principled, scientific approaches to writing fast code. He has previously worked on parallel programming models, theories of performance, diagnostic tools, and compilers to simplify the task of software performance engineering. His work on the Tapir/LLVM compiler earned the best paper award at the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming in 2017. Dr. Schardl received his S.B. in Computer Science and Electrical Engineering from MIT in 2009, his M.Eng. in Computer Science and Electrical Engineering from MIT in 2010, and his Ph.D. in Computer Science and Engineering from MIT in 2016.Livestream link: https://www.youtube.com/channel/UCYs2iUgksAhgoidZwEAimmg/live G32-449 (Patil/Kiva) Belfer sarah_donahue@hks.harvard.edu

August 27

Add to Calendar 2019-08-27 15:00:00 2019-08-27 16:00:00 America/New_York Algorithms and Systems for Processing Massive Static and Evolving Graphs Abstract: Today, the largest publicly available graph dataset is the Hyperlink Web graph, with over 3.5 billion vertices and 128 billion edges. Despite being publicly available for several years now, most experimental work in the literature reports results on much smaller graphs, or resorts to external or distributed memory to process the Hyperlink graph. A natural question then, is to ask whether we can efficiently solve a broad class of problems on this graph in memory. Given that existing systems capable of processing this graph only support static processing, another interesting question is whether we can efficiently represent and update this graph in memory. In the first part of this talk I will present theoretically efficient implementations of parallel graph algorithms for a set of 20 important graph problems. Our codes, which we have publicly open-sourced as part of the Graph Based Benchmark Suite (GBBS) solve these problems on the Hyperlink Web graph using a single commodity multicore machine with a terabyte of RAM, with most of the codes running in under a minute on this graph. The running times of our implementations outperform existing state-of-the-art implementations on the largest real-world graphs, and for many of the problems that we consider, this is the first time they have been solved on graphs at this scale. In the second part of this talk I will present Aspen, a framework for processing streaming graphs, which are graphs that evolve over time via a stream of updates (batches of edge insertions/deletions). The key idea is to represent the adjacency information using a purely-functional compressed tree data structure called a C-tree, which can be efficiently updated. Aspen achieves millisecond latency for processing small batches, and achieves throughputs of hundreds of millions of updates per second for very large batches, while providing strict serializability for both queries and updates. Using Aspen, we are able to concurrently update and run analytics on the Hyperlink Web graph using a single commodity multicore server with a terabyte of RAM. Based on joint work with Guy Blelloch and Julian Shun. Bio: Laxman is a fifth year PhD student at CMU, where he is fortunate to be advised by Guy Blelloch. His graduate work has focused mostly on designing fast and theoretically efficient parallel graph algorithms.Livestream link: https://www.youtube.com/channel/UCYs2iUgksAhgoidZwEAimmg/live G32-449 (Patil/Kiva) Belfer sarah_donahue@hks.harvard.edu

September 16

Domain-Specific Accelerators

NVIDIA Corporation and Stanford University
Add to Calendar 2019-09-16 14:15:00 2019-09-16 15:15:00 America/New_York Domain-Specific Accelerators Abstract: Increasing computing performance enables new applications and greater value from computing. With the end of Moore's Law and Dennard Scaling, continued performance scaling will come primarily from specialization. Specialized hardware engines can achieve performance and efficiency from 10x to 10,000x a CPU through specialization, parallelism, and optimized memory access. Graphics processing units are an ideal platform on which to build domain-specific accelerators. They provide very efficient, high performance communication and memory subsystems - which are needed by all domains. Specialization is provided via "cores", such as tensor cores that accelerate deep learning or ray-tracing cores that accelerate specific applications. This talk will describe some common characteristics of domain-specific accelerators via case studies. Bio: Bill Dally is Chief Scientist and Senior Vice President of Research at NVIDIA Corporation and a Professor (Research) and former chair of Computer Science at Stanford University. Bill is currently working on developing hardware and software to accelerate demanding applications including machine learning, bioinformatics, and logical inference. He has a history of designing innovative and efficient experimental computing systems. While at Bell Labs Bill contributed to the BELLMAC32 microprocessor and designed the MARS hardware accelerator. At Caltech he designed the MOSSIM Simulation Engine and the Torus Routing Chip which pioneered wormhole routing and virtual-channel flow control. At the Massachusetts Institute of Technology his group built the J-Machine and the M- Machine, experimental parallel computer systems that pioneered the separation of mechanisms from programming models and demonstrated very low overhead synchronization and communication mechanisms. At Stanford University his group developed the Imagine processor, which introduced the concepts of stream processing and partitioned register organizations, the Merrimac supercomputer, which led to GPU computing, and the ELM low-power processor. Bill is a Member of the National Academy of Engineering, a Fellow of the IEEE, a Fellow of the ACM, and a Fellow of the American Academy of Arts and Sciences. He has received the ACM Eckert-Mauchly Award, the IEEE Seymour Cray Award, the ACM Maurice Wilkes award, the IEEE-CS Charles Babbage Award, and the IPSJ FUNAI Achievement Award. He currently leads projects on computer architecture, network architecture, circuit design, and programming systems. He has published over 250 papers in these areas, holds over 160 issued patents, and is an author of the textbooks, Digital Design: A Systems Approach, Digital Systems Engineering, and Principles and Practices of Interconnection Networks.Livestream: https://www.youtube.com/channel/UCYs2iUgksAhgoidZwEAimmg/live G32-449 (Patil/Kiva) Belfer sarah_donahue@hks.harvard.edu

September 23

Add to Calendar 2019-09-23 14:00:00 2019-09-23 15:00:00 America/New_York Tiramisu: A Polyhedral Compiler for Dense and Sparse Deep Learning Abstract: Tiramisu is a polyhedral compiler for deep learning. It has two unique features: (1) it is the first sparse DNN compiler; and (2) it can express and optimize general RNNs (Recurrent Neural Networks). Tiramisu relies on a flexible representation based on the polyhedral model and has a rich scheduling language allowing fine-grained control of optimizations. We show that Tiramisu matches or outperforms Intel MKL-DNN by up to 5x for sparse DNNs. We also show that Tiramisu allows many new capabilities such as wavefront parallelization for RNNs.Bio: Riyadh Baghdadi is a postdoctoral associate at CSAIL. He got his PhD from ENS Paris. He works on the area of compiler optimization and code generation for high performance architectures. Recently, he has been interested in exploring the intersection of compilers and deep learning. 32-D463 (Star) Belfer sarah_donahue@hks.harvard.edu

September 30

Add to Calendar 2019-09-30 14:00:00 2019-09-30 15:00:00 America/New_York Julia: Making dynamic programs run fast Abstract: Julia is a general purpose language that utilizes type inference combined with a Just-In-Time (JIT) compiler, built upon LLVM, to achieve C-like performance while retaining the dynamic semantics of interactive languages like MATLAB, Python, and R. This talk will briefly introduce Julia, explore how it is possible for a dynamic language to be fast and to how Julia is making use of LLVM. We will focus on interacting with the compiler infrastructure and explore how to Julia targets accelerators like NVIDIA GPUs.Bio: Valentin Churavy is a PhD student at CSAIL. His research at the JuliaLab focuses on heterogeneous distributed computing for high-performance computing. He is interested in human-compiler interaction and how to help people write fast code in a high-level dynamic language. 32-D463 (Star) Belfer sarah_donahue@hks.harvard.edu

October 21

Add to Calendar 2019-10-21 14:00:00 2019-10-21 15:00:00 America/New_York SplinterDB: Closing the Bandwidth Gap on NVMe Abstract: Key-value stores are a fundamental component to storage systems. Recent advances in storage hardware, especially the advent of NVMe devices, expose new bottlenecks in the design of current key-value stores. We introduce SplinterDB and show how it achieves a 6-9x improvement in the insertion rate over Facebook's RocksDB on NVMe devices, while also improving query performance by 2-2.5x.Bio: Alex Conway is a PhD student in Computer Science at Rutgers University, advised by Martin Farach-Colton. His research is focused on external memory theory and storage systems. His recent publications include work on high performance key-value stores, file system aging, external memory hash tables and the analysis of storage optimization heuristics.He is a member of the BetrFS team. BetrFS, a prototype file system, is designed around the principles of write-optimization and is built using variants of B\epsilon-trees. G32-449 (Patil/Kiva) Belfer sarah_donahue@hks.harvard.edu

November 04

Add to Calendar 2019-11-04 14:00:00 2019-11-04 15:00:00 America/New_York GraphIt: A Domain-Specific Language for Writing High-Performance Graph Applications Abstract: In recent years, large graphs with billions of vertices and trillions of edges have emerged in many domains, such as social network analytics, machine learning, physical simulations, and biology. There is a strong demand for a programming system that allows domain experts to easily write applications that run quickly on these large graphs. However, programming high-performance graph applications is a challenging problem. The performance bottlenecks of graph applications depend on the algorithm, the size and structure of the graph, and the underlying hardware. There is not a set of optimizations that work well across all algorithms, graphs, and hardwares. Additionally, domain experts are often not familiar with low-level implementation details, such as atomic instructions and bit manipulations, needed for today's high-performance graph frameworks.In this talk, I will present our work on GraphIt, a new domain-specific language (DSL) that offers an easy-to-use programming model, achieves consistent high-performance across different algorithms and graphs, and supports both unordered and ordered graph algorithms. GraphIt decouples algorithms from performance optimizations (schedules) for graph applications to make it easy to explore a large space of performance optimizations. The compiler leverages a new graph iteration space model to generate fast code for different combinations of optimizations. We also introduce new priority-based extensions to support ordered graph algorithms and are working on code generation for GPU architectures.Bio: Yunming Zhang is a sixth year PhD student at MIT working with Saman Amarasinghe and Julian Shun. He works on domain-specific compilers, languages and performance optimizations for large-scale data analytics applications. 32-G449 (Kiva) Belfer sarah_donahue@hks.harvard.edu

November 18

Add to Calendar 2019-11-18 14:00:00 2019-11-18 15:00:00 America/New_York How to Modernize Compiler Technology Abstract: In this talk, I will show the path to the modernization of one important compiler technique -- vectorization. Vectorization was first introduced in the era of Cray vector processors during the 1980's. While vector supercomputers needed large vectors, which are mainly available by parallelizing loops, modern SIMD instructions efficiently work on short vectors. Thus, in 2000, Larsen et. al. introduced a new technique known as Superword Level Parallelism (SLP) based vectorization to fill this gap. First, I will show how we can take advantage of the power of modern computers for compilation, by using more accurate but expensive techniques to improve SLP vectorization. Due to the hardware resource constraints of the era, like many other compiler optimizations, SLP implementation was a greedy algorithm. In 2018, we introduced goSLP, which uses integer linear programming to find an optimal instruction packing strategy. Next, I will show how to truly modernize a compiler by automatically learning the necessary components of the compiler with Ithemal and Vemal. The actual cost of execution in a modern out-of-order, superscalar processor is much more complex than a linear model. Manually building such cost models as well as manually developing compiler optimizations is costly, tedious and is hard to keep up with the architectural changes. Ithemal is the first learnt cost model for predicting the throughput of x86 basic blocks with minimal human effort. Vemal is the first learnt policy for end-to-end vectorization as opposed to tuning heuristics. These data-driven techniques achieve state-of-the-art results while also reducing the development and maintenance burden of the compiler developer. 32-G449 Kiva/Patil Belfer sarah_donahue@hks.harvard.edu

November 25

Add to Calendar 2019-11-25 14:00:00 2019-11-25 15:00:00 America/New_York A Speed-of-Light Internet Service Provider Abstract: A variety of network applications, including electronic commerce and games, are either enabled by or benefit greatly from low latency communications. Studies have shown, however, that over medium and long distances the time to send a packet from one city to another on the public Internet is typically more than three times larger than the lower bound implied by the speed of light in free space. Hence for applications like high-frequency trading, where the winner of a communications race receives all the benefits, special purpose networks have been deployed. For example, between New Jersey and Chicago a succession of networks has been deployed, first fiber-based and then microwave-based, with each network reducing latency by a fraction of a millisecond over the previous. This talk explores the possibility of using the same radio technology to build a network backbone spanning the 120 largest population centers in the United States. The design places radios on existing towers, using topographic maps to ensure line-of-sight connectivity between towers. The impact of weather on the network is evaluated using historical weather data. Our analysis suggests that it should be possible to achieve mean speeds of over 95% of the speed of light over medium and long distances at a transmission cost of under $1 per GB.Joint work with Sangeetha A. J., Anthony Aquirre, Waqar Aqeel, Nadi Bozkurt, Bala Chandrasekaran, Brighten Godfrey, Greg Laughlin, Ankit Singla, and Muhammad Tirmazi.Bio: Bruce Maggs received the S.B., S.M., and Ph.D. degrees in computer science from the Massachusetts Institute of Technology in 1985, 1986, and 1989, respectively. His advisor was Charles Leiserson. After spending one year as a Postdoctoral Associate at MIT, he worked as a Research Scientist at NEC Research Institute in Princeton from 1990 to 1993. In 1994, he moved to Carnegie Mellon, where he stayed until joining Duke University in 2009 as a Professor in the Department of Computer Science. While on a two-year leave-of-absence from Carnegie Mellon, Maggs was a founding employee of Akamai Technologies, serving as its first Vice President for Research and Development. In 2018 he was part of a large team that received the inaugural SIGCOMM Networking Systems Award for the Akamai CDN, and was named an ACM Fellow. 32-G449 (Kiva) Belfer sarah_donahue@hks.harvard.edu

February 18

Add to Calendar 2020-02-18 14:00:00 2020-02-18 15:00:00 America/New_York Safe Parallel Programming -- ParaSail, Ada 202X, OpenMP, and Rust Abstract: Is it possible to bring support for safe parallel programming to modern programming languages, where the compiler does the work of detecting possible data races while the programmer simply identifies where in their program they would like to take advantage of multicore/manycore hardware capabilities? This talk will introduce the programming language ParaSail (Parallel Specification and Implementation Language) which provides safe parallel programming in the context of a pointer-free, object-oriented, strongly-typed modern programming language. The talk will also include a discussion of other recent work to bring compile-time safety to parallel programming, including the upcoming 202X version of the Ada programming language, the OpenMP multiplatform, multilanguage API for parallel programming, and Rust, a language that from the beginning tried to provide safe concurrent programming, and more recently has provided a safe light-weight parallelism library called Rayon.See https://arxiv.org/ftp/arxiv/papers/1902/1902.00525.pdf for more information on ParaSail.Bio: S. Tucker Taft has been working on programming language design for his entire career. Tucker is currently VP and Director of Language Research at AdaCore. Tucker led the Ada 9X language design team, culminating in the February 1995 approval of Ada 95 as the first ISO standardized object-oriented programming language. His specialties include programming language design, advanced static analysis tools, formal methods, real-time systems, parallel programming, and model-based development. Tucker is a member of the ISO Rapporteur Group that developed Ada 2005 and Ada 2012. Tucker has also been designing and implementing a parallel programming language called "ParaSail," and defining parallel programming extensions for Ada as part of the forthcoming Ada 202X standard. Prior to joining AdaCore, Tucker was Founder and CTO of SofCheck, Inc., providing tools and technology to enhance software development quality and productivity. Prior to that Mr. Taft was a Chief Scientist at Intermetrics, Inc. and its follow-ons for 22 years. Tucker received an A.B. Summa Cum Laude degree from Harvard University, where he has more recently taught compiler construction and programming language design. 32-D449 (Kiva) Belfer sarah_donahue@hks.harvard.edu