There has been a lot of progress happening in the area of using large language models (LLMs) for computational biology applications. LLMs, which are commonly trained using a large corpus of text data, can also be trained using DNA sequences. This can be as simple as swapping out text data during normal pre-training steps in many popular architectures, with tokenized DNA sequences, but it can often involve more DNA specific modifications such as kmerization and/or special masking techniques. The idea is that deep learning models learn relationships between the nucleotides in DNA sequences, in a similar way to how it learns relationships between words. This sounds good in principle, but there are many details that need to be considered when using this kind of technique.
Read along in my blog as I share my perspectives on all this.
DNA LLM Advancements and Limitations to Solve Hard Tasks
When we take a look at DNABERT, one of the early LLMs trained on DNA sequences, we see a BERT architecture, an encoder only model, with a relatively small context size of 512, a parameter count of less than 100 million, and training on only human genome data1. DNABERT did rather well on several tasks presented in the original paper, such as splice site finding, promoter identification, and transcription factor site identification1. As this was one of the first models, this seemed promising enough for groups to move up into bigger and better models.
Then, early in 2023, the Nucleotide Transformer (NT) model was released2. NT is a much larger language model, appears to be a modified ESM (Evolutionary Scale Modeling) architecture, trained with a larger, more diverse dataset, and most importantly, it was benchmarked on a wider range of tasks. In the NT paper, they tested models of different sizes and diversity of DNA training data, so it’s interesting to see that there are some benchmark tasks that are relatively “easy” for pretty much all models, such as promoter and splice site identification.
At the other end, there were some tasks, such as identification of enhancer regions or methylation sites, where it seems that modification of the model size or training diversity didn’t affect the performance.
There are also tasks in between where the larger, more diverse models do show an advantage.
These results suggest a few things. Firstly, different DNA feature prediction tasks likely have a wide range of complexities; promoters are easy, enhancers are hard. Second, certain tasks, such as identification of enhancers, seem to be indifferent to model size and do not show improvement with increased model complexity.
This suggests to me that the embeddings being learned by the model, as they are relevant for certain tasks, can be learned relatively easily and quickly, but that they level out, unable to train into a representation that can solve the problem completely.
It’s as if only partial information is being learned from the data for that particular task, then any additional training goes towards learning other features of the DNA. This leads to the natural question of:
Why? These models seem very capable, they have many parameters, they are using a lot of diverse data, so what’s the limitation? What could be preventing the model from solving these hard tasks?
Context Size and Training Strategies Greatly Impact Model Learning and Performance
It may be the context size, perhaps the training data just doesn’t contain enough relevant information for the particular task, because the input lengths are too short. The context size of the NT model is 1000, so any identifying interactions that define an enhancer that are more than 1000 base pairs away would not be identified.
Or perhaps the type of masked training may be biasing the learned embeddings towards particular tasks. In the case of both DNABERT and NT, they use a variant of token dropout (masked language modeling). Although token dropout seems like a very capable and reasonable training method, there are many others, such as next token prediction, n-gram masking, token deletion, permutation language modeling, etc. One can imagine masking techniques that rely on annotations of the DNA sequences could be used to mask out entire DNA elements. This might result in embeddings that are better equipped for prediction tasks of certain DNA elements. One could also imagine applying a denoising objective, which might favor the training of embeddings that better understand potential variations or canonical regions of elements. I can’t say what type of technique would be best for a particular task, but my point is that the way you do your training may be very important to what the model learns. Seems obvious, but the implications can be significant when trying to optimize a DNA based LLM for a particular task.
Finally, the training data may simply not contain enough examples for the particular task you’re attempting to train it to answer. Some tasks may be considered “rare” relative to one another. This means that using something like the human genome for training, without modification, may mean you are using a severely unbalanced dataset.
So yes, these DNA based large language models may be very powerful and they hold lots of potential, but there also seems to be lots of room for optimization if you want to use them for your particular task. One could think of differently pretrained models as different sources of information. Additionally, models can also be thought of like datasets, since a model could essentially be used as a function to create (DNA sequence)-(DNA feature) pairs. This kind of view is interesting, since it means that validation of pretrained embeddings becomes very important when attempting to utilize them for downstream tasks.
Efforts like this from Wu et al is an interesting approach towards comparing the relevance of datasets to tasks3. This view also affects strategies like multitask networks, since the performance of such approaches are sensitive to the relevance of the included tasks. For example, if you’re attempting to use a pretrained, DNA based language model, but your model hasn’t learned the right features for the task, then those embeddings might be an amalgam of different tasks, being used like a very un-optimal multitask network.
Form Bio’s Commitment to Quality DNA LLM Development
At Form Bio, we are tackling these problems head on by prioritizing the quality and quantity of all input data we are using to train our own DNA LLMs. We are developing methods to quantify datasets, models, and functions for their relevance to predict our target tasks. We’ve made significant progress towards optimized models for improving gene delivery constructs and are constantly expanding our understanding and intuition on how to explore the massive search space surrounding different training methods of DNA LLMs.
AI Disclosure: Feature image was generated by an AI image tool MidJourney.