Understanding How Best to Allocate Resources When Training Large Language Models in Gene Therapy Development Programs
DeepMind researchers released a large language model (LLM) that uses fewer parameters yet achieves better performance. Read more to understand how this is relevant to your gene therapy programs.
Nick Ketz, PhD
January 31, 2023
Researchers at DeepMind released a large language model (LLM) called Chinchilla that used fewer parameters than its predecessors but achieved better performance by processing more data rather than training more parameters.
By training over 400 models to fit a power-law function the researchers show the need to collect more data to gain performance.
Due to decreased costs of genomic and proteomic data processing, it will prove time and cost beneficial to obtain and use more data vs training more parameters in gene therapy LLM development programs.
The development of gene therapy programs often present with challenges of allocating resources effectively to maximize results. When it comes to training language models that are large in size or scope, the task can become especially daunting due to computational costs, data availability and other potential obstacles. Recently, DeepMind researchers released a large language model that uses fewer parameters yet achieves better performance. Learn how this approach can help you understand how best to allocate large language model (LLM) resources in gene therapy development programs.
How DeepMind’s LLM outperforms its predecessors
Last year researchers at DeepMind released a large language model (LLM) that out-performed openAI’s GPT-3 and even Microsoft’s enormous Megatron model. Usually this would be made possible by some new architecture design increasing the model complexity, or simply making the model bigger by adding more trainable parameters1,2. In fact that is exactly what they did the previous year with a LLM called Gopher which had 280 billion trainable parameters compared to GPT-3’s 170 billion3. What was interesting is that DeepMind’s latest model, called Chinchilla, used approximately the same Transformer design as GPT-3 and Gopher, but used fewer parameters! How was this possible?
Fewer parameters and more data = better performance
The simple answer is that they used more data, a lot more data. However, the key insight was that the compute usage for training a LLM was non-optimal, and that compute cycles would be better used processing more data rather than training more parameters. You can think of compute-flops (shown above) as proportional to the product of the number parameters in a model, N, and the number of data samples that need to be learned, D.
\[FLOPs(N,D) = (N * D) * constant\]
How best to balance these two terms, N and D, was mostly ignored until recently, and approximately the same dataset size of 300B tokens was used for all LLMs4. This paper sought to empirically determine, what is the optimal N and D, for a fixed compute budget C, to achieve the minimum training loss, L:
To do this the researchers trained over 400 models within 3 approaches: Fix N and vary D; Fix D and vary N; using data from these two approaches to fit a power-law function of N and D to extrapolate the expected loss. What they find is that each of the approaches provides similar answers, which is that parameters and data are approximately equal in importance for achieving best performance, with data being perhaps more important.
This suggests that current LLM are grossly over parameterized and data deficient for optimal compute usage. This motivated the training Chinchilla to validate these results, using a model 70B parameters and a dataset size of 1.4 Trillion(!) tokens, which subsequently showed better performance and lower loss than much larger models.
The resource savings of training large transformer models in gene therapy programs
Specific to gene therapy programs, the question remains: Is it more time and cost effective to make the language model bigger, or go out and get more genomic and proteomic data? This question can now be quantified due to the Chinchilla program. Genomics data collection is increasingly less expensive, therefore obtaining more data may prove to be much more beneficial. Moreover, when leveraging a massive amount of data, LLMs thus far train on each data sample obscuring the learning potential for multiple passes through the data. Similar analysis could be carried out to determine what the diminishing returns on re-using data would be.
Given this scaling law and the dataset used to train a given LLM, it seems plausible to distill these larger models into their compute-optimal sizes without loss in performance5. This work may provide tangible bounds for understanding scaling laws of performance within Transformer model distillation.
The importance of data highlighted here, and the finite nature of data available within cell and gene therapy programs, suggests that data generation methods could be the key towards progress with larger models. Generative methods in reinforcement learning, e.g. World Models, allow an agent to train from its own internal model of its environment to dramatically improve data efficiency. How would data like this fit in the scaling laws derived here is again an open question6.
Potential future research includes understanding diminishing returns on reusing data multiple times, distilling larger Transformer models into their compute-optimal sizes and improving data efficiency using generative methods.
Interested in receiving emerging computational life science trends straight in your inbox?