NECSTLab Technical Talks

Speaker

NECSTLab

Politecnico di Milano

Host

Emanuele Del Sozzo

MIT FutureTech

NECSTLab (Politecnico di Milano, Italy), led by Prof. Marco D. Santambrogio, will be visiting MIT and giving a series of technical talks as part of the NECST Group Conference (NGC), an initiative offering participants a unique opportunity to present their work at leading companies' headquarters and engage with research groups and laboratories at top-tier universities. Here is the list of talks (abstracts and bios are reported below):

A Quantum Method to Match Vector Boolean Functions Using Simon’s Solver
Speaker: Marco Venere
ALVEARE: A Full-Stack Domain-Specific Framework for Regular Expressions
Speaker: Filippo Carloni
GrOUT: A Modular Framework for Scalable Multi-GPU Systems in Oversubscribed Scenarios
Speaker: Ian Di Dio Lavore
Moyogi: Exploiting Random Forest Parallelism for Low-Latency Inference on Embedded Devices
Speaker: Alessandro Verosimile

Coffee and pastries will be provided

The NECSTLab is a laboratory inside DEIB department of Politecnico di Milano (Dipartimento di Elettronica, Informazione e Bioingegneria). It is a place where research meets teaching, and teaching meets research, also through academics and industrial events.

------------

A Quantum Method to Match Vector Boolean Functions Using Simon’s Solver
Speaker: Marco Venere

Abstract: The Boolean Matching Problem is a fundamental step in modern Electronic Design Automation toolchains, which allow for the efficient design of large classical computers. In particular, the equivalence under negation-permutation-negation of two n-to-n vector Boolean functions requires the exploration of a super-exponential number of possible negations and permutations of input and output variables, and is widely regarded as a daunting challenge. Its classical complexity (O(n!2^(2n))), where n is the number of input and output variables, is rarely tolerated by EDA tools, which are typically solving small instances of the Boolean Matching Problem for n-to-1 Boolean functions. In this work, we present a method to exploit the solver for Simon’s problem to speed-up the matching of n-to-n vector Boolean functions, as we show that, despite its higher complexity, it is friendlier to a quantum solver than matching single-output Boolean functions. Our solution allows for saving a factor 2^n in the overall worst-case computational effort, and is amenable to combined approaches such as the so-called Grover-meets-Simon, which have the potential of reducing it below the cost of classical n-to-1 matching. We provide a fully detailed quantum circuit implementing our proposal, and compute its cost, both counting the required amount of qubits and quantum gates. Furthermore, our experimental evaluation employs the ISCAS benchmark suite, a de-facto standard for classical EDA to derive our sample Boolean functions.

Bio: Marco Venere is a second-year PhD student in Information Technology. His research focuses on the design of quantum algorithms to achieve superpolynomial speedup w.r.t. classical computation, and on quantum error correction accelerated on FPGAs. Besides his main research topics, he also works on the compilation of quantum circuits. He was also a TPC member for IEEE QCE 2024, a member of the Quantum Open-Source Foundation, a reviewer for IEEE TCAD, and was awarded a microgrant from UnitaryFund.

ALVEARE: A Full-Stack Domain-Specific Framework for Regular Expressions

Speaker: Filippo Carloni

Abstract: Regular Expressions (REs) represent one of the most pervasive but challenging computational kernels to execute. Indeed, RE matching enables the identification of functional data patterns in heterogeneous fields ranging from personalized medicine to computer security. However, such applications require massive data analysis that, combined with the high data dependency of the REs, leads to long computational times and high energy consumption. Currently, RE engines rely on either flexibility in run-time RE adaptability and broad operators to support impairing performance or fixed high-performing accelerators implementing few simple RE operators. This talk describes ALVEARE: a hardware-software approach combining a Domain-Specific Language (DSL) with a RE-tailored Domain-Specific Architecture (DSA), constituting a full-stack framework. Specifically, ALVEARE exploits REs as a DSL by translating them into executables via the proposed compiler, while the DSA performs the RE-matching efficiently through a speculation-based RISC microarchitecture. The microarchitecture is based on the proposed Instruction Set Architecture to effectively express RE operators, from standard and simple to advanced primitives widely employed in real benchmarks. Our RE-centric optimized compiler lifts part of the RE-matching complexity from the hardware to the software, simplifying the architecture design to keep high performance and better flexibility. ALVEARE showcases attractive results in execution times and energy efficiency against existing CPU-based and ASIC-based solutions in low-latency and near-data scenarios.

Bio: Filippo Carloni is a PhD candidate in Information Technology (Computer Science and Engineering) at Politecnico di Milano. His PhD research focuses on domain-specific architectures and compilers, with a particular emphasis on the regular expressions domain. He works extensively with hardware description languages and FPGAs to address architectural challenges and implement advanced solutions. During the final year of his PhD, he was a visiting student at the COMMIT lab at MIT, where he began exploring SmartNIC hardware acceleration. His broader research interests include RISC-V architecture and ISA design.

GrOUT: A Modular Framework for Scalable Multi-GPU Systems in Oversubscribed Scenarios

Speaker: Ian Di Dio Lavore

Abstract: Hardware accelerators are vital in modern computing but face significant challenges when handling workloads that exceed available memory capacity. Unified Virtual Memory (UVM) offers a promising solution by enabling oversubscription, allowing the end- user to handle datasets larger than the hardware’s physical memory. However, the page- faulting mechanism used by oversubscription often introduces severe performance overheads, particularly for large-scale workloads. This work presents GrOUT, a language- and domain-agnostic framework designed to address these challenges in oversubscribed multi-GPU systems. GrOUT employs a modular architecture with GraalVM-based and C++ header-only frontends, providing flexibility and ease of integration. The framework optimizes memory usage by eliminating redundant data copies through inter-process communication (IPC) and enhances scalability by transitioning to an MPI-based communication model. These advancements enable developers to scale out workloads efficiently, mitigating the impact of oversubscription and ensuring robust performance across distributed GPU systems.

Bio: Ian Di Dio Lavore is a Ph.D. student in Information Technology - Computer Science and Engineering at Politecnico di Milano. He holds an M.Sc. (2022) and a B.Sc. (2020) in Computer Science and Engineering from Politecnico di Milano. Ian worked in the HPC Team of the Scalable Computing and Data group at the Pacific Northwest National Laboratory (PNNL). His research mainly focuses on parallel and distributed computing and programming models, with a particular interest in high-level abstractions for heterogeneous HPC systems.

Moyogi: Exploiting Random Forest Parallelism for Low-Latency Inference on Embedded Devices

Speaker: Alessandro Verosimile

Abstract: The convergence of Artificial Intelligence (AI) and Internet of Things (IoT) is driving the need for real-time, low-latency architectures to trust the inference of complex Machine Learning (ML) models in critical applications like autonomous vehicles and smart healthcare. While traditional cloud-based solutions introduce latency due to the need to transmit data to and from centralized servers, edge computing offers lower response times by processing data locally. In this context, Random Forests (RFs) are highly suited for building hardware accelerators over resource-constrained edge devices due to their inherent parallelism. Nevertheless, maintaining a low latency as the size of the RF grows is still critical for state- of-the-art (SoA) approaches. To address this challenge, this paper proposes Moyogi, a hardware-software codesign framework for memory-centric RF inference that optimizes the architecture for the target ML model, employing RFs with Decision Trees (DTs) of multiple depths and exploring several architectural variations to find the best-performing configuration. We propose a resource estimation model based on the most relevant architectural features to enable effective Design Space Exploration. Moyogi achieves a geomean latency reduction of 3.88x on RFs trained on relevant IoT datasets, compared to the best-performing SoA memory-centric architecture.

Bio: Alessandro Verosimile is a second-year PhD student in Information Technology at Politecnico di Milano. He worked for 6 months as a research intern in the RAD team of Advanced Micro Devices (AMD). His research focuses on HW-SW co-design techniques that aim to co-optimize the training of large Machine Learning models and the design of the hardware architecture for their inference on embedded devices, with a focus on both Deep learning models and Decision Tree based ensemble models.

Add to Calendar 2025-02-12 9:00:00 2025-02-12 11:00:00 America/New_York NECSTLab Technical Talks  NECSTLab (Politecnico di Milano, Italy), led by Prof. Marco D. Santambrogio, will be visiting MIT and giving a series of technical talks as part of the NECST Group Conference (NGC), an initiative offering participants a unique opportunity to present their work at leading companies' headquarters and engage with research groups and laboratories at top-tier universities. Here is the list of talks (abstracts and bios are reported below):  A Quantum Method to Match Vector Boolean Functions Using Simon’s SolverSpeaker: Marco Venere ALVEARE: A Full-Stack Domain-Specific Framework for Regular ExpressionsSpeaker: Filippo Carloni GrOUT: A Modular Framework for Scalable Multi-GPU Systems in Oversubscribed ScenariosSpeaker: Ian Di Dio Lavore Moyogi: Exploiting Random Forest Parallelism for Low-Latency Inference on Embedded DevicesSpeaker: Alessandro Verosimile Coffee and pastries will be provided The NECSTLab is a laboratory inside DEIB department of Politecnico di Milano (Dipartimento di Elettronica, Informazione e Bioingegneria). It is a place where research meets teaching, and teaching meets research, also through academics and industrial events.------------A Quantum Method to Match Vector Boolean Functions Using Simon’s SolverSpeaker: Marco VenereAbstract: The Boolean Matching Problem is a fundamental step in modern Electronic Design Automation toolchains, which allow for the efficient design of large classical computers. In particular, the equivalence under negation-permutation-negation of two n-to-n vector Boolean functions requires the exploration of a super-exponential number of possible negations and permutations of input and output variables, and is widely regarded as a daunting challenge. Its classical complexity (O(n!2^(2n))), where n is the number of input and output variables, is rarely tolerated by EDA tools, which are typically solving small instances of the Boolean Matching Problem for n-to-1 Boolean functions. In this work, we present a method to exploit the solver for Simon’s problem to speed-up the matching of n-to-n vector Boolean functions, as we show that, despite its higher complexity, it is friendlier to a quantum solver than matching single-output Boolean functions. Our solution allows for saving a factor 2^n in the overall worst-case computational effort, and is amenable to combined approaches such as the so-called Grover-meets-Simon, which have the potential of reducing it below the cost of classical n-to-1 matching. We provide a fully detailed quantum circuit implementing our proposal, and compute its cost, both counting the required amount of qubits and quantum gates. Furthermore, our experimental evaluation employs the ISCAS benchmark suite, a de-facto standard for classical EDA to derive our sample Boolean functions.Bio: Marco Venere is a second-year PhD student in Information Technology. His research focuses on the design of quantum algorithms to achieve superpolynomial speedup w.r.t. classical computation, and on quantum error correction accelerated on FPGAs. Besides his main research topics, he also works on the compilation of quantum circuits. He was also a TPC member for IEEE QCE 2024, a member of the Quantum Open-Source Foundation, a reviewer for IEEE TCAD, and was awarded a microgrant from UnitaryFund. ALVEARE: A Full-Stack Domain-Specific Framework for Regular ExpressionsSpeaker: Filippo CarloniAbstract: Regular Expressions (REs) represent one of the most pervasive but challenging computational kernels to execute. Indeed, RE matching enables the identification of functional data patterns in heterogeneous fields ranging from personalized medicine to computer security. However, such applications require massive data analysis that, combined with the high data dependency of the REs, leads to long computational times and high energy consumption. Currently, RE engines rely on either flexibility in run-time RE adaptability and broad operators to support impairing performance or fixed high-performing accelerators implementing few simple RE operators. This talk describes ALVEARE: a hardware-software approach combining a Domain-Specific Language (DSL) with a RE-tailored Domain-Specific Architecture (DSA), constituting a full-stack framework. Specifically, ALVEARE exploits REs as a DSL by translating them into executables via the proposed compiler, while the DSA performs the RE-matching efficiently through a speculation-based RISC microarchitecture. The microarchitecture is based on the proposed Instruction Set Architecture to effectively express RE operators, from standard and simple to advanced primitives widely employed in real benchmarks. Our RE-centric optimized compiler lifts part of the RE-matching complexity from the hardware to the software, simplifying the architecture design to keep high performance and better flexibility. ALVEARE showcases attractive results in execution times and energy efficiency against existing CPU-based and ASIC-based solutions in low-latency and near-data scenarios.Bio: Filippo Carloni is a PhD candidate in Information Technology (Computer Science and Engineering) at Politecnico di Milano. His PhD research focuses on domain-specific architectures and compilers, with a particular emphasis on the regular expressions domain. He works extensively with hardware description languages and FPGAs to address architectural challenges and implement advanced solutions. During the final year of his PhD, he was a visiting student at the COMMIT lab at MIT, where he began exploring SmartNIC hardware acceleration. His broader research interests include RISC-V architecture and ISA design. GrOUT: A Modular Framework for Scalable Multi-GPU Systems in Oversubscribed ScenariosSpeaker: Ian Di Dio LavoreAbstract: Hardware accelerators are vital in modern computing but face significant challenges when handling workloads that exceed available memory capacity. Unified Virtual Memory (UVM) offers a promising solution by enabling oversubscription, allowing the end- user to handle datasets larger than the hardware’s physical memory. However, the page- faulting mechanism used by oversubscription often introduces severe performance overheads, particularly for large-scale workloads. This work presents GrOUT, a language- and domain-agnostic framework designed to address these challenges in oversubscribed multi-GPU systems. GrOUT employs a modular architecture with GraalVM-based and C++ header-only frontends, providing flexibility and ease of integration. The framework optimizes memory usage by eliminating redundant data copies through inter-process communication (IPC) and enhances scalability by transitioning to an MPI-based communication model. These advancements enable developers to scale out workloads efficiently, mitigating the impact of oversubscription and ensuring robust performance across distributed GPU systems.Bio: Ian Di Dio Lavore is a Ph.D. student in Information Technology - Computer Science and Engineering at Politecnico di Milano. He holds an M.Sc. (2022) and a B.Sc. (2020) in Computer Science and Engineering from Politecnico di Milano. Ian worked in the HPC Team of the Scalable Computing and Data group at the Pacific Northwest National Laboratory (PNNL). His research mainly focuses on parallel and distributed computing and programming models, with a particular interest in high-level abstractions for heterogeneous HPC systems. Moyogi: Exploiting Random Forest Parallelism for Low-Latency Inference on Embedded DevicesSpeaker: Alessandro VerosimileAbstract: The convergence of Artificial Intelligence (AI) and Internet of Things (IoT) is driving the need for real-time, low-latency architectures to trust the inference of complex Machine Learning (ML) models in critical applications like autonomous vehicles and smart healthcare. While traditional cloud-based solutions introduce latency due to the need to transmit data to and from centralized servers, edge computing offers lower response times by processing data locally. In this context, Random Forests (RFs) are highly suited for building hardware accelerators over resource-constrained edge devices due to their inherent parallelism. Nevertheless, maintaining a low latency as the size of the RF grows is still critical for state- of-the-art (SoA) approaches. To address this challenge, this paper proposes Moyogi, a hardware-software codesign framework for memory-centric RF inference that optimizes the architecture for the target ML model, employing RFs with Decision Trees (DTs) of multiple depths and exploring several architectural variations to find the best-performing configuration. We propose a resource estimation model based on the most relevant architectural features to enable effective Design Space Exploration. Moyogi achieves a geomean latency reduction of 3.88x on RFs trained on relevant IoT datasets, compared to the best-performing SoA memory-centric architecture.Bio: Alessandro Verosimile is a second-year PhD student in Information Technology at Politecnico di Milano. He worked for 6 months as a research intern in the RAD team of Advanced Micro Devices (AMD). His research focuses on HW-SW co-design techniques that aim to co-optimize the training of large Machine Learning models and the design of the hardware architecture for their inference on embedded devices, with a focus on both Deep learning models and Decision Tree based ensemble models. TBD

Organizer & Contact

Emanuele Del Sozzo

delsozzo@mit.edu

NECSTLab Technical Talks

Speaker

Host

February 12 2025

Location

Organizer & Contact

May 12

Thesis Defense: Scaling Cooperative Intelligence via Inverse Planning and Probabilistic Programming

May 08

Automatic Integration and Differentiation of Probabilistic Programs

NECSTLab Technical Talks

Speaker

Host

February 12 2025

Location

Organizer & Contact

Related Events

May 12

Thesis Defense: Scaling Cooperative Intelligence via Inverse Planning and Probabilistic Programming

May 08

Automatic Integration and Differentiation of Probabilistic Programs