New Study Finds Large Language Models Applied to DNA Sequences Enabled Accurate Molecular Phenotype Prediction
Recent published research is bringing us one step closer to predicting molecular phenotypes from a DNA sequence. Read this blog to learn more.
Austin Day, PhD
February 10, 2023
Most computational life science research has been centered around protein sequence based models and their application to predict structure, binding, or other biophysical properties but in a recently published manuscript, researchers have explored using a large language model (LLM) on DNA sequences exclusively1. This study found that these pre-trained models were able to effectively capture key regulatory genomic elements, notably enhancers and promoters. In testing combinations of model size with dataset diversity, they found that overall, increasing one wasn’t enough, and that both model size and data diversity was necessary to improve sequence reconstruction performance. The foundation models explored in this study provide confidence in future applications to accurately predict molecular phenotype based on DNA sequence alone. Read below for a quick breakdown of the methods and finding of this important work.
Creation of Foundation Language Models for DNA Sequence Data
There has been an explosion recently of transformer based models, in particular in modeling language. These large language models (LLMs) are interesting because they undergo unsupervised training on extremely large amounts of text data and are able to learn syntax and relationships between distant features in the input sentences, a similar relationship to how DNA or protein sequences behave. Because of this, many groups have been attempting to use LLMs to learn from biological sequence data as well, essentially treating single amino acids or stretches of k-merized DNA as “words” in a vocabulary.
In this manuscript, Dalla-Torre et al. created four foundation models for DNA sequence data of various sizes, ranging from 500M to 2.5B parameters. These models were trained on three different datasets encompassing the Human reference genome, a collection of 3,200 diverse human genomes, and 850 genomes from several species (Table 6 from manuscript).
They split their datasets into a human reference genome dataset (500M), a 1000 Genomes Project dataset (1000G), and a multispecies dataset. The goal was to see the effect of both data diversity and model size on the performance of their final models.
The model types they used were all encoder-only transformers. They had an embedding layer to transform the sequences of tokens into sequences of embeddings. They used learnable positional encoding to add onto those embeddings to provide positional information. Their maximum input size was 1000 tokens. These embeddings are then fed into a transformer stack, where each layer sends the input through a normalization and multi-head self-attention layer. The output of the self-attention layer is summed with the transformer layer input through a skip connection. The result is then passed through a new normalization layer and a two-layer perceptron with GELU activations. The hyperparameters used are shown in the table below:
The group essentially pre-trained their models on these large datasets, used those models to obtain embeddings, or representations, of their input sequences. They reduced the size of those embeddings through dimensionality reduction techniques and used those reductions as features for 18 DNA prediction tasks (15 of which they did better than SOTA). These tasks and the results of the models are shown below.
Dataset Diversity and Increase in Model Size Are Needed to Effectively Capture Key Features of DNA
The research conducted by this group yielded unique insights about large language models (LLMs) in predicting key features of DNA. These findings indicate that the best performance for a given task is dependent on both the model and the layer being used2. The researchers tested methodology involving both probing and fine-tuning. In probing, they used the embeddings of DNA sequences as features to simpler models for prediction of tasks. For example, they took embeddings from ten arbitrary layers of their LLMs and used them as features in a logistic regression model or MLP model. When using this probing technique, their models outperformed 11 out of 18 baseline models. When they moved to a fine-tuning approach, using a technique from Lie et al., allowing the entire model or parts of a model to be updated during training of the new task, they were able to outperform on 15 out of 18 tasks, drastically reducing compute requirements.
Through testing combinations the study found that pre-trained models were able to effectively capture key features of DNA through analysis of attention maps, embedding spaces, and probability distributions and that model size and dataset diversity, the group found that improving sequence reconstruction performance requires both an increase in model size and data diversity. This suggests that for more complex prediction tasks in the future, longer input lengths will be necessary and that more efficient language modeling techniques such as sparse attention should be implemented.
At Form Bio, we are leveraging the learnings from this research to develop the most accurate and safe DNA large language models to accelerate the development and manufacturing of gene therapies. And as you know, this field changes at breakneck speed. Click here for my latest blog on this topic.
Want to keep up to date on emerging computational life science trends?