22 April 2022
Julia Zeitlinger
Stowers Institute for Medical Reserch
Kanas City
There has been a revolution in genomics technology, leading to an exponential growth of multi-modal genomics data in different organisms, tissues and cell types. There is now a unique opportunity to harvest the information from these data in a unified learning paradigm. After decades of focusing on mechanisms underlying gene expression, it is now time to come back to a concept that has its origins before the rise of molecular biology and biochemistry: the understanding that biology has a DNA sequence basis. With the development of neural networks that predict genomics data from sequence, learning how gene regulation is encoded in DNA is now a feasible goal. It does however require a drastic departure from previous computational approaches and biological reasoning. In essence, the new learning paradigm requires an inverted thinking. Traditionally, we take genomics datasets apart in a hypothesis-driven fashion and extract sequence rules one at a time to build more complex models. In the new paradigm, neural networks learn to predict the data from intact genomic sequences, allowing highly complex combinatorial rules to be learned inside a black box. Only upon achieving high accuracy, the relevant sequences and rules are extracted from the model. This inverted learning paradigm is not limited by set biological assumptions, is inherently measured by performance, and directly allows predicting the effect of disease mutations. This unifying learning paradigm enables the comprehensive mapping of regulatory sequences for many different genomic assay modalities, providing knowledge on gene regulation that is bound to transform biology and medicine.