Grammatical structure in developmental protein/DNA binding

Cells choose their identity as a result of combinatorial expression of proteins called transcription factors that bind to specific DNA sequences and turn on and off sets of genes. Our understanding of this cellular programming is rudimentary, but a more complete characterization could enable the conversion of one cell type into another with transformative therapeutic consequences. We have devised a machine learning technique that identifies the genomic binding location of a large number of transcription factors in a given cell state based on an experimental dataset called DNase-Seq, and we have applied this algorithm to embryonic stem cells as they progressively decide to become pancreatic cells at the expense of alternate fates. Given this corpus, we would like to learn spatial and temporal grammars of the transcription factor binding co-occurrences that determine this developmental transition, and relate these to observed gene expression change. Candidate binding models will be tested using a novel massively parallel reporter assay with results fed back to inform new theories. Approaches for this project could include, but are not limited to, topic models, compressed sensing, constraint systems, and L1 regularization.
Contact Charles W. O'Donnell