An Open-Source Toolkit for De-Identifying Databases
Speaker: Hooman Katirai, Clinical Decision Making Group, CSAIL
Date: Tuesday, April 26 2005
Time: 3:00PM to 4:15PM
Refreshments: 2:45PM
Location: 32-250 Lounge
Host: Prof Peter Szolovits, CSAIL Clinical Decision Making Group
Contact: Fern DeOliveira, x3-5860, fern@csail.mit.edu
Relevant URL: Privacy laws are an important facet of our society. But they can also serve as formidable barriers to important research. In the medical field, for example, privacy laws make it difficult for researchers to freely access the medical records they need to conduct studies on the causes of disease.
But there is hope that these barriers can be lifted through technology. In the US for example, the same privacy laws that prevent casual disclosure of medical information also authorize hospitals to release medical information without a patient's consent if the information is de-identified using a statistical algorithm prior to its release. The opportunity afforded by laws such as these have given birth to a field of Computer Science known as "computational disclosure control."
Anonymizing data is not enough to guarantee privacy. Most people would, for example, consider a database consisting only of the fields {zip, date of birth, gender} to be "anonymous"; however, one researcher showed that 85% of the US population could be uniquely identified using only these fields.
Thus, more rigorous techniques for de-identification of information are required. One promising technique called k-anonymity modified each record such that each record in the database matches at least k other individuals in the population.
But there are often numerous ways to de-identify a record using k-anonymity. In fact, we show it's an NP-hard optimization problem. This motivates the need for a measure of information loss to guide de-identification algorithms.
We show that existing attempts to formulate an information loss measure in literature fall prey to what is called the "weighted indexing problem" and are therefore not rationally defensible. We then share some current research directions in which we are seeking a rationally defensible information loss measure to guide the de-identification of medical information.
This is presented in the context of an open source de-identification toolkit, which we expect to publicly release in the near future.
See other events happening in April 2005