For many years now, transformer based large language models (LLMs) have become the state of the art in many natural language processing (NLP) tasks such as translation, summarization, and sentiment analysis. Previously based on architectures like recurrent neural networks (RNNs), convolutional neural networks (CNNs), or long short term memory networks (LSTMs), large language models have essentially spilled over in their use into many other fields, which, notably, includes biology. Language is interesting in that it is essentially a representation of knowledge through the use of syntax and context. Where it really gets interesting is that both natural language and biological sequences rely on very similar structures. At the simplest level, nucleotides and amino acids can be thought of as the letters in the language of biology, and LLMs can learn this language through self-supervised, masked training techniques on large amounts of data. Many examples of this crossover have already emerged, such as ESM-21, DNABERT2, and scBERT3.
DNA Feature Prediction Performance as a Proxy for Success of Transformer-Based Large Language Models
With the advent of transformer-based large language models (LLMs), our processing of natural languages has taken great strides. Instead of utilizing architectures like RNNs or CNNs, LLMs have quickly become a go-to choice for many NLP tasks, such as translation and sentiment analysis—but they do not stop there! The implications these deep learning models offer in terms of understanding knowledge representation through syntax and context are now extending exponentially into domains uncharted before--including biology where large genomic datasets can be leveraged with intense fine tuning to capture deeper insights despite the lack supervision. Model scale here is driving performance to levels never seen before; opening up limitless possibilities beyond natural language use cases.
The reasoning behind why LLMs can be applied to biological sequences is compelling, but the real test is in how they perform. A good example of this in the protein space is the already mentioned ESM-2, which has subsequently been used for protein structure prediction and has been shown to contain information on various protein biophysical properties in its embedding1. Arguably more relevant for gene therapy are LLMs trained on DNA/RNA sequences, as many gene delivery constructs are designed and delivered as DNA sequences. On that front, a great example of how LLMs can be useful is in the Nucleotide Transformer paper4 where an LLM trained on over 800 different species’ genomes was benchmarked on 18 different tasks related to methylation site, enhancer, promoter, or splice site prediction. Some tasks were better performing than others, but it is evident that a pre-trained LLM does learn some useful features related to DNA structure and context that can be useful in non-obvious DNA feature prediction tasks.
Strategies for Designing Optimized Gene Therapy DNA Constructs
This idea opens up many possibilities in the area of gene therapies. In the widely used AAV9 vector, DNA constructs, composed of inverted terminal repeats (ITRs), promoters, enhancers, and protein coding regions, have the design requirement of being easily manufacturable, meaning low contaminants, low truncation products, and high yield of full length DNA in capsids. There are many factors that go into this optimization, but one important one has to do with the design of the DNA construct. It is thought that the secondary structure of the DNA construct is an important determinant of truncation propensity in AAV production5,6. Meta’s ESM model has shown how LLMs trained on protein sequences can be incredibly powerful for predicting protein structure, and although an equivalent demonstration doesn’t yet exist for DNA/RNA structure, it is arguably a similar problem and may be solved with similar methods1. From a more general perspective, if LLMs trained on DNA sequences can produce features that help in identifying different functional sequences of the DNA, then it is possible that the model could be fine-tuned on a particular prediction task such as identifying the sequences that contribute to truncation propensity.
Overcoming Data Scarcity in Gene Therapy Development: Leveraging Abundant Data from Related Tasks to Reduce Data Dependence
There is another layer to consider in the use of LLMs to aid in gene therapy development. In machine learning, data is king, but data is also often also in short supply, especially when expensive cell based assays or long turnaround times are involved. LLMs can help mitigate this problem because in the way that it is trained, it is effectively acting like a multi-task network, which has been shown to help reduce the need for data for a particular task by utilizing more abundant data for related tasks7. The understanding necessary to achieve the masked training objective in BERT-like models is effectively an understanding of many overlapping DNA features that all contribute to the identity of a particular masked sequence region. The model must learn multiple objectives simultaneously, but more than that, one can also add on explicit tasks during pre-training or fine tuning to enhance this effect further. It changes the way we can think about solving a particular problem. For example, if you need to predict the level of contaminants in an AAV production run, a very relevant piece of data would be the DNA sequence of the construct, but other types of data that may be related or would provide a shortcut in understanding. A possible shortcut towards predicting contaminants could be prediction of DNA secondary structure elements, as they are likely directly related to truncation propensity. An example of a related task could be predicting expression levels in mammalian cells from DNA sequence, a task that has much more available data. Protein expression may not be obviously related to AAV contaminants, but it is possible that factors that contribute to a DNA sequences’ ability to express well may partially be related to its secondary structure, repeats, or other relevant factors, and so that expression data may still contain some amount of information relevant for predicting truncations as well. In a sense, solving a problem might be thought of as engineering a flow of relevant information.
Boosting Performance: Effective Optimization Techniques for Accelerating LLMs in Gene Therapy Development Programs
The use of Large Language Models (LLMs) in various applications comes with several technical challenges. One of the main challenges is the size of the models, which requires significant computational resources for training and inference. The large size of the models makes them more expensive, slower, and more difficult to train, compared to smaller models. There are various ways one could mitigate this problem, such as creating custom-tailored models optimized for a specific task, such as DistilBERT12, which are smaller than the original models. Another solution is to use specialized hardware, such as graphics processing units (GPUs), and tensor processing units (TPUs) to speed up training and inference. Additionally, specialized software, such as DeepSpeed13, Fairscale14, TensorRT15, Triton16, chatGPT/Github Copilot 17,18, and PyTorch Lightning19 can be used to improve training speed, inference speed, and development time. Specific to neural networks, one can also utilize pre-trained networks on similar tasks and only fine tune them for the task of interest.
This is only a beginner’s view of the kinds of optimizations one could implement to speed up the use of LLMs. In reality, the more advanced techniques are constantly being introduced and iterated upon. For example, groups have utilized pieces of other pre-trained networks to piece together networks for complex tasks, in a sort of neural architecture search which can improve training times8. Various techniques like those implemented in PerceiverIO20 or DeepNorm21 allow for larger input lengths or deeper networks 9,10. The UL2 method is a technique aimed at reducing the pre-training time by adding extra steps during pre-training with a different objective11.
At Form Bio, we recognize the potential and importance of leveraging any and all technologies available to accelerate the development of life saving therapies. Large language models are just one of the promising tools available, but we recognize that flexibility is needed in utilizing AI/ML techniques as the field moves incredibly fast. By staying at the forefront of this technology, Form Bio aims to be the first in applying cutting edge technologies to new and important applications in gene and cell therapies.