Biomanufacturing

Understanding How Best to Allocate Resources When Training Large Language Models in Gene Therapy Development Programs

DeepMind researchers released a large language model (LLM) that uses fewer parameters yet achieves better performance. Read more to understand how this is relevant to your gene therapy programs.

Nick Ketz, PhD

January 31, 2023

Understanding How Best to Allocate Resources When Training Large Language Models in Gene Therapy Development Programs

Blog Highlights:

Researchers at DeepMind released a large language model (LLM) called Chinchilla that used fewer parameters than its predecessors but achieved better performance by processing more data rather than training more parameters.
By training over 400 models to fit a power-law function the researchers show the need to collect more data to gain performance.
Due to decreased costs of genomic and proteomic data processing, it will prove time and cost beneficial to obtain and use more data vs training more parameters in gene therapy LLM development programs.

The development of gene therapy programs often present with challenges of allocating resources effectively to maximize results. When it comes to training language models that are large in size or scope, the task can become especially daunting due to computational costs, data availability and other potential obstacles. Recently, DeepMind researchers released a large language model that uses fewer parameters yet achieves better performance. Learn how this approach can help you understand how best to allocate large language model (LLM) resources in gene therapy development programs.

How DeepMind’s LLM outperforms its predecessors

Last year researchers at DeepMind released a large language model (LLM) that out-performed openAI’s GPT-3 and even Microsoft’s enormous Megatron model. Usually this would be made possible by some new architecture design increasing the model complexity, or simply making the model bigger by adding more trainable parameters^1,2. In fact that is exactly what they did the previous year with a LLM called Gopher which had 280 billion trainable parameters compared to GPT-3’s 170 billion³. What was interesting is that DeepMind’s latest model, called Chinchilla, used approximately the same Transformer design as GPT-3 and Gopher, but used fewer parameters! How was this possible?

Figure 1. Contour plot of various LLM loss as a function of dataset size and model size (parameters), where color shows the expected model loss based on fit power-law equation. Taken from Chinchillas wild implications. Lesswrong.com.

‍Fewer parameters and more data = better performance

The simple answer is that they used more data, a lot more data. However, the key insight was that the compute usage for training a LLM was non-optimal, and that compute cycles would be better used processing more data rather than training more parameters. You can think of compute-flops (shown above) as proportional to the product of the number parameters in a model, N, and the number of data samples that need to be learned, D.

\[FLOPs(N,D) = (N * D) * constant\]

How best to balance these two terms, N and D, was mostly ignored until recently, and approximately the same dataset size of 300B tokens was used for all LLMs⁴. This paper sought to empirically determine, what is the optimal N and D, for a fixed compute budget C, to achieve the minimum training loss, L:

\[N_{opt},D_{opt} = \underset{FLOPs(N,D)=C}{argmin} = L(N,D)\]

To do this the researchers trained over 400 models within 3 approaches: Fix N and vary D; Fix D and vary N; using data from these two approaches to fit a power-law function of N and D to extrapolate the expected loss. What they find is that each of the approaches provides similar answers, which is that parameters and data are approximately equal in importance for achieving best performance, with data being perhaps more important.

Figure 2. Examples of optimal compute balance for various model sizes1.

This suggests that current LLM are grossly over parameterized and data deficient for optimal compute usage. This motivated the training Chinchilla to validate these results, using a model 70B parameters and a dataset size of 1.4 Trillion(!) tokens, which subsequently showed better performance and lower loss than much larger models.

The resource savings of training large transformer models in gene therapy programs

Specific to gene therapy programs, the question remains: Is it more time and cost effective to make the language model bigger, or go out and get more genomic and proteomic data? This question can now be quantified due to the Chinchilla program. Genomics data collection is increasingly less expensive, therefore obtaining more data may prove to be much more beneficial. Moreover, when leveraging a massive amount of data, LLMs thus far train on each data sample obscuring the learning potential for multiple passes through the data. Similar analysis could be carried out to determine what the diminishing returns on re-using data would be.

Given this scaling law and the dataset used to train a given LLM, it seems plausible to distill these larger models into their compute-optimal sizes without loss in performance⁵. This work may provide tangible bounds for understanding scaling laws of performance within Transformer model distillation.

The importance of data highlighted here, and the finite nature of data available within cell and gene therapy programs, suggests that data generation methods could be the key towards progress with larger models. Generative methods in reinforcement learning, e.g. World Models, allow an agent to train from its own internal model of its environment to dramatically improve data efficiency. How would data like this fit in the scaling laws derived here is again an open question⁶.

Potential future research includes understanding diminishing returns on reusing data multiple times, distilling larger Transformer models into their compute-optimal sizes and improving data efficiency using generative methods.

Interested in receiving emerging computational life science trends straight in your inbox?

Sign up for our newsletter

References

Sifre, L et al., Training Compute Optimal Large Language Models, DeepMind. Published March 2022. Accessed Jan 2023.
Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, the Worlds Largest and Most Powerful Generative Language Model. Published Oct 2011. Accessed Jan 2023.
Language Modeling at Scale: Gopher, ethical considerations and retrieval. Published Dec 2021. Accessed Jan 2023.
Amodei D et al., Scaling Laws for Neural Language Models. arXiv.2001.08361.
Yang, H. Knowledge Distillation of Transformer-based Language Models Revisited. arXiv:2206.14366.
Mastering Diverse Domains through World Models. Accessed Jan 2023.

More to Explore

Biomanufacturing

Recent Advancements in AAV Production Process for Gene Therapy Manufacturing

Jill Roughan, PhD

August 6, 2024

Explore the AAV production process for gene therapy manufacturing, as well as the techniques and efficiency improvements to enhance AAV payload purity.

Artificial Intelligence

Accelerating AAV Gene Therapy Research with AI-assisted Biological Experiments

Bijan Zakeri, PhD

June 4, 2024

Discover how AI-assisted biological lab experiments accelerate AAV research and de-risk vector design.

Artificial Intelligence

AAV Gene Therapy Product Characterization using Next-Generation Sequencing (NGS)

Claire Aldridge, PhD

May 21, 2024

See how AAV product characterization can be optimized using next-generation sequencing (NGS).