I’d like to take a deeper dive into a recently released paper: “Predicting RNA-seq coverage from DNA sequence as a unifying model of gene regulation” and some speculative implications, in general but also it’s potential use in AAV gene construct design applications1. The paper was published in September 2023 by a group at Calico (Johannes Linder, Divyanshi Srivastava, Han Yuan, Vikram Agarwal, and David R. Kelley) that, on the surface, appears to be an interesting paper related to predicting RNA-seq coverage. With all the machine learning models being developed for every little task and dataset, it's easy to brush this off as another fancy model that successfully predicts values from an abundant dataset, but I gained more insight from it, so I’d like to share some of that.
One of the holy grails of genomics is understanding and predicting how DNA sequences relate to observable phenotypes. Understanding all the messy steps in between (transcription, translation, etc…) is what is hindering progress towards this goal. That’s likely part of the motivation behind this paper and the Borzoi model they describe. What I find interesting is that although they trained the model by predicting RNA-seq experiments, they were able to use the learned attention weights during the prediction of RNA-seq data to inform and predict many related tasks. They compare the results of various genetic prediction tasks with specialized state-of-the-art models for those tasks and find favorable results. (ie. predicting expression levels due to additional binding motifs, effects of benign mutations, identification of transcription factor regulators, etc…) They even suggest that the model has perhaps indirectly learned the effects of codons and polyadenylation motifs on transcript stability.
Using RNA-seq Datasets to Unlock Insights into Identifying Gene Regulatory Elements
The parallel I find intriguing is that the Borzoi model wasn’t trained directly on these benchmark tasks and that learning how to predict RNA-seq datasets end to end allowed the model to learn these many related tasks like identifying gene regulatory elements. This seems very similar to how LLMs can learn to understand or interpret written language through masked language modeling. If you take a look at BERT (or DNABERT), it’s a model that was trained to predict masked portions of an abundant dataset (text sequences) and through that prediction task, it learns relevant embeddings for many tasks indirectly2.
I think this may be related to the idea of emergent properties that people usually attribute to language models. When training language models with these very complex datasets, the models have to learn many sub-tasks to effectively predict masked positions and the sub-tasks are allowed to overlap and share information since they’re using a shared parameter space. For state-of-the-art models with trillions of parameters, these sub-tasks and interconnections between them might just seem so vast, complex, and nuanced to us that we just call them “emergent properties”. It almost seems like we’re equating emergent properties to some kind of magic, like an emergence of consciousness in LLMs, or something inevitable that happens on its own. As a parallel, one might argue that the benchmark tasks of the Borzoi model demonstrate “emergent properties” in the DNA realm, although more interpretable.
Challenging Assumptions About Biological Data Dynamics
This is just speculation, but I think emergent properties shouldn’t be taken for granted, especially in the realm of biology. In my early years in the biotechnology industry, datasets were scarce and expensive. Things haven’t changed too much, but even with the most recent advances in high throughput technologies, the dataset quality between language and DNA datasets is qualitatively different and should merit a tailored approach. Language is structured, with rules made for the human mind to understand and interpret, and we can generate valid examples on demand. Biological data, however, lives in the realm of combinatorics. Its rules are not made to be understood easily by human language and many problems have an effective search space that easily dwarfs the estimated number of atoms in the universe. Can we realistically expect emergent properties to come about from biological data in the same way as language modeling tasks?
Perhaps the quality of an emergent property can vary, but I would dare to say that the only way we will get to a useful foundation model for biology problems is by training a model end to end with a very large dataset that contains very complex biological information. I think the Borzoi model may have had this in mind when choosing to use RNA-seq data. RNA-seq counts, a proxy for mRNA expression, might be seen as a sort of fundamental data type for expression or a sort of pre-train task that helps to learn embeddings for related tasks.
Navigating the Information Hierarchy from DNA to Protein for Comprehensive Insights
Similarly, DNA-based models trained using masked language modeling (unsupervised modeling) seem to learn many subtasks as well, but the not-so-obvious result of this is the relevance of those subtasks for predicting a desired task. Let’s take gene expression as an example. Is genomic data sufficient for predicting mRNA expression? There are factors like epigenetics or post-translational modifications that can set the state of a cell independent of the DNA sequence alone. So then it kind of makes sense that the Borzoi model utilized RNA-Seq data and not simply a pre-trained DNA model. However, there are failure points in the Borzoi model, the most notable one I thought was that it can’t really predict the effects of mutations that aren’t seen among the dataset. (Non allelic variants). This tells me that the model still doesn’t have enough data, or that the data coverage isn’t enough, or that the data doesn’t contain the right information. As capable as it is in other regards, that failure hints at how vast the search space is. Is it even practical to gather enough data to make these kinds of predictions possible, or at some point are we just memorizing outcomes? I feel a different approach is needed, one that understands fundamental rules and limits a model’s ability to simply memorize.
This all may be a long-winded way of saying that if we are going to be using unsupervised learning in biological problems, we need to understand that there may not be one ideal dataset that covers all our needs, but that “moving up” on the chain, from DNA to RNA and predicting protein expression and function, would likely expose us to different parts of the information hierarchy. For example, masked DNA modeling may allow us to better build DNA sequences whereas RNA-seq prediction may allow us to understand the interactions between those pieces of expression. Would something further up the chain like protein expression, protein interactions, or post-translational modifications allow us to predict in more domains? Could we move “horizontally” to learn specialized tasks like those related to rAAV therapies?
Seeing as proteins interact with DNA to ultimately determine expression, would integrating all these datasets allow us to “close the loop” and create a more all-encompassing foundation model? Is there a way to integrate an unsupervised learning task for DNA and expression data in a more unified way?
Leveraging Learnings from the Borzoi Model to Optimize AAV Construct Design
Since Form Bio specializes in solutions in rAAV therapies, I tend to look through a different lens. If we wanted to predict gene expression in rAAV therapy constructs and not in the human genome context, the information from the Borzoi model could be useful in that it can help to identify enhancers and regulatory elements relevant to a target gene sequence. Still, it is trained in the context of the human genome and the relevance to a 5kb rAAV construct may not be generalized. In a sense, the vast amounts of RNA-seq data out there in the public may not be relevant outside of the human genome context. Distances between regulatory elements matter. The context in which the delivered rAAV constructs exist, versus being integrated in the genome matters.
The Borzoi model itself states that one of the downsides of the model is that it isn’t good at predicting mutations that are not part of normally seen alleles, which tells me that even in its context, it still doesn’t have enough data to generalize, so how can we hope to use this model and its learned knowledge for our rAAV tasks? One solution to this is through the generation of rAAV-specific RNA-seq datasets. By training a model on a task similar to RNA-seq prediction, like Borzoi, then fine-tuning on one of our proprietary rAAV construct datasets, is one of the only viable paths towards a useful model in the near term.