Information Extraction from the World Wide Web: Discriminative Finite State Models, Feature Induction and Scoped Learning
Speaker: Andrew McCallum , UMass Amherst
Date: February 6 2003
The Web is the world's largest knowledge base. However, its data is in a form intended for human reading, not manipulation, data mining and reasoning by computers. Today's search engines only help people find web pages. Tomorrow's search engines will also help people find "things" (like people, jobs, companies, products), facts, their relations and trends.
Information extraction is the process of filling fields in a database by automatically extracting sub-sequences of human readable text. Finite state machines are the dominant model for information extraction both in research and industry. In this talk I will give several examples of information extraction tasks performed at WhizBang Labs, and then describe new finite state models designed to take special advantage of the multi-faceted nature of text on the web. Maximum Entropy Markov Models and Conditional Random Fields are discriminative sequence models that allow each observation to be represented as a collection of arbitrary overlapping features (such as word identity, capitalization, part-of-speech, layout and formatting---plus features from the past and future). I will introduce both models, skim over their parameter estimation algorithms, present some new work in feature induction, and give experimental results on real-world tasks. I will then describe Scoped Learning, a method that further improves information extraction and classification by taking advantage of local regularities in training and test sets.
Joint work with Fernando Pereira, John Lafferty, Dayne Freitag, David Blei, Drew Bagnell, and many others at (the former) WhizBang Labs.
See other events that are part of AI Colloquium Series Spring 2003
See other events happening in February 2003