Miguel Paredes: "An Integrative Data Science framework to jointly address Prediction, Quasi-Causation, and Causation"

Speaker

Miguel Paredes
ALFA Group, CSAIL, MIT-DUSP

Host

Una-May O'Reilly
ALFA Group, CSAIL
Committee:
Dr. Una-May O'Reilly, Dissertation Chair (CSAIL, MIT)
Professor Roy Welsch (IDSS / ORC / Sloan, MIT)
Professor David Geltner (DUSP, MIT)

Dissertation Abstract:

Many fundamental problems in society such as medical decision support, urban planning and client understanding can be addressed by data-driven modeling. Frequently the only data available are observational rather than experimental. This precludes causal inference though it supports quasi-causal inference (or causal approximation) and prediction based on association. With 3 different studies that are driven by observational data, this thesis compares and contrasts machine learning and econometric modeling in terms of their purposes, insights and uses. It proposes a data science methodology that combines both types of modeling to enable experimental designs which would otherwise be impossible to carry out: it uses econometric models to understand a problem space and derive quasi-causal information that informs the variables it selects for machine learning models. Then it uses machine learning models to predict outcome likelihoods of members of a population, defining a study group with likelihoods above a threshold of interest. Then the quasi-causal insights are used to design a stratified randomized controlled trial (i.e. A/B test) where study subjects are randomly assigned to one of three balanced experimental groups. Finally, due to the rigorously design experiment, it is able to determine the causal effects of the interventions, and determine the cost-effectiveness of the treatments relative to the control group. The end-to-end methodology is presented in the third study in which an enterprise seeks to address its customer churn. Observational records of prior churn are leveraged to understand the determinants of churn (e.g. age, socioeconomic status, tenure with the company, number of claims in the year, etc.) via discrete choice and survival regression econometric models. These variables are selected and, with a churn label, used to enhance the training of a predictive machine learning model. The model is then used to predict the likelihoods of customers to churn, and those more likely to churn are assembled into 3 balanced cohorts through a randomized controlled trial (RCT). One cohort serves as a control group, the second cohort (Treatment A) receives a treatment that incentivizes renewal and the third cohort (Treatment B) receives the same treatment with an extra incentive. The RCT is carried out and the most effective treatment is identified in a systematic, cost-efficient and reliable manner, translating to a 6-percentage point churn reduction, which is of significant monetary value for the enterprise.