3 May 2019
Katie Pollard
Gladstone Institutes, UCSF &
Chan-Zuckerberg Biohub
Machine learning is a popular statistical approach in many fields, including genomics. We and others have used a variety of supervised machine-learning techniques to predict regulatory enhancers and the genes that they activate, as well as to quantify and interpret the effects of sequence mutations on regulatory function. I will highlight a few of these studies, emphasizing the strengths and weaknesses of different predictive models and the biological insights gained via variable importance analysis. Then I will talk about some of our recent work exploring the limitations of popular machine-learning methods in genomics, where the biology underlying the data used to train the models frequently violates one or both parts of the independent and identically distributed (IID) assumption. The talk will conclude with some thoughts on modeling non-IID data in systems biology and interpreting over-fit models.