Bibliometrics using Machine Learning and Natural Language Processing

Reviewing past and current literature is a key scientific and engineering activity.  Quantitative analysis of the language, topics, keywords, or citation graphs of any given subset of scientific literature (bibliometrics) can be a great help to understanding what has been done in a field and what the important next steps are. Unfortunately, off-the-shelf support for automatic extraction of citation graphs and analysis of those graphs using natural language processing and machine learning is still relatively limited.  This UROP project will aim to advance the state of the art of bibliometrics on a number of different fronts.  As a warm-up, we will first write a few small research programs that check the well-formedness of a citation network.  Next, we will aim to take a research tool developed at the University of Maryland called Action Science Explorer (http://www.cs.umd.edu/hcil/ase/) and retrofit it so that it can generate its suite of bibliometric analyses given an already-constructed citation network.  Following this we will seek to automate the extraction of citation networks from the pdfs of a given set of articles, along with identification of important citations not currently included in the given set of articles.  At this stage the project has many possible branches and can evolve into research involving machine learning, natural language processing, or network analysis. At all stages we will have the goal of producing software that is well-documented, unit-tested, and can be released to the general public.

This is a challenging project that will require the student to engage with a large, complex, poorly-documented Java codebase, figure out what it does and how it works (I like to call this “forensic programming”), clean it up, and add the required functionality.  Ideally this calls for a student who is already an established Java programmer.  Furthermore, the student should be highly motivated, self-directed, and independent, and also have enough free time in their schedule to make rapid progress.  The project is open-ended and can go in a lot of different directions depending on the student’s interest, and could potentially develop into an M.Eng thesis involving machine learning, natural language processing, or network analysis.  Because of this, we seek a student who in principle would able to commit at least one year to the project, is interested in working over the summer, and potentially would be interested in continuing after one year if the project is going well and all parties are amenable.  For more information and to schedule an interview, send your resume and the names/emails of two references by February 15 to Dr. Mark Finlayson, markaf@mit.edu.