Supporting Updates in Information Extraction Pipelines [Zoom Talk]
Abstract: Information extraction, a key step in text processing and understanding task, has a rich history spanning decades. However, a notable caveat in this domain has been the tendency to treat the extracted relations and source documents as isolated entities post-deployment. This talk proposes a paradigm shift: treating extracted relations as materialized views of document databases, thereby unlocking two significant advantages. Firstly, we demonstrate that this paradigm shift exposes the feasibility of updatable extracted views and thus allows extractors to contribute solutions to open problems in unstructured data management. Secondly, we explore how viewing extracted relations as materialized views enables the development of innovative optimizations through static analyses of programs. The talk concludes with a forward-looking discussion on potential future research directions.
Speaker Bio: Besat is a Postdoctoral Researcher at the Cheriton School of Computer Science, University of Waterloo, where she works with Prof. Renée J. Miller on data intelligence. She earned her Ph.D. in 2023 under the supervision of Prof. Frank Tompa, specializing in unstructured data management. Her research interests center on unstructured data, with a particular focus on information retrieval, information extraction, and data discovery in large-scale data lakes. In her recent work, she has developed a technique for quantifying and improving novelty in table retrieval from data lakes. Besat’s contributions have been recognized through publications in leading venues, including PVLDB, ACM Transactions on Computing for Healthcare, the ACM Symposium on Document Engineering (DocEng), and CLEF. Her work on the reformulation of mathematical queries recently received the Best Paper Award at the ACM DocEng 2025 conference.
---
For the Zoom passcode, please contact markakis@mit.edu