Exploring the Versatility of the Latest Transformer Architecture Variants and Their Implications for Gene Therapy Developers
Continuing our blog series on Transformer architectures and their use in gene therapy applications, we explore the latest variants’ size and speed improvements.
Austin Day, PhD
March 22, 2023
The Transformer architecture has been all the rage since the architecture was originally published in 2017, but it hasn’t been sitting still since then1. Groups worldwide have grabbed onto this successful architecture and have been very quickly iterating upon it, coming up with variants that are bigger, stronger, and faster in various applications. There is so much information out there related to transformer variants to begin with and new variants are being tested and published all the time. In order to cut through the noise, I’ll write more blog posts that talk about some of the more interesting or new transformer architectures. The goal of this article series is not to provide a complete overview of all of the kinds of modifications and improvements that have ever been made to the original transformer architecture, but instead, I want to talk about examples in order to help us better understand how we can improve transformers in our specific biological and genomic applications. For background on transformers, which I would recommend understanding before proceeding further, there are many resources available online1,5, 6, 7. Additionally, you can just have a study session with ChatGPT 8, since it seems to have a very good understanding of the details of transformer architectures.
Year 2019: The Evolved Transformer
I’ll start with the evolved transformer2. In this paper, they took the vanilla transformer and used an evolution-based neural architecture search algorithm to make changes to and test modifications. Their benchmarks were machine translation tasks, the same WMT datasets as the original transformer paper1. They ended up with an architecture that was performing better for this particular task than the original transformer, for a given number of parameters. The evolved architecture for the encoder is shown below.
It’s interesting that there seems to be additional convolutional layers before the attention mechanism, perhaps suggesting that some kind of compression may be useful before attention is used. They emphasized that their model was much more efficient at smaller model sizes, but that the performance difference becomes smaller at larger model sizes. This suggests to me that perhaps the original transformer architecture may be overkill for the task at hand. If early compression improves performance, then the information lost during that compression maybe wasn’t all that necessary and performance gains from working with a smaller representation outweighed the loss of model expressivity. There are many ways to speculate on these changes, but the point I want to make here is that there is room for improvement, especially when specializing an architecture for a particular prediction task.
Overcoming Transformer Limitations with Size, Speed and Hardware Improvements
One of the first weaknesses identified in the original transformer architecture was the attention matrix. It has quadratic complexity with respect to the input sequence length, which really limits the input size. Many of the first iterations involved speeding up or simplifying the attention architecture in some way to get around this problem. Some “Fast Transformers” utilized various optimizations to help address this quadratic dependency. The main types of improvements involve speeding up the attention calculation function, AKA: Matrix multiplication, AKA: Dot product.
Dot Product Speed Improvements
The “Linear Transformer,” introduced by Katharopoulous et al implemented a different way of doing the dot-products, what they call a “feature map based dot product”, which essentially exploits the associative property of matrix products to reduce compute and memory requirements in autoregressive transformers3. Similar approaches for speed ups also targeted the dot product computation, such as the “Efficient Attention” approach that provides an alternative, faster way of doing the dot product attention computation11. However, most recently in 2022, DeepMind Research used AlphaTensor to discover a new, general, and faster way to do matrix multiplications (dot products) as well9. These types of improvements are very important, and I expect a lot of these theoretical improvements to dot product calculations to be implemented in hardware kernels or in various software backends, abstracted away from my direct thought. However, it is good to be aware of, as one day when perhaps a new attention function is implemented differently, we can be aware that it has yet to undergo all this kind of optimization.
Compression / Approximation Improvements
Using a slightly different approach, back in 2020, Apoorv et al tried to improve the attention mechanism by applying what they call “clustered attention”, where instead of computing the attention matrix for all queries, they clustered them first, then computed the attention for only those clusters4. This approach is more like approximating the original transformer, or applying some kind of compression on the data first. They did show that in some language tasks, the speed was twice as fast for a similar performance. There is also the “Linformer” which also approximates the self attention mechanism with a low-rank matrix, and “Reformer” which replaces dot-product attention by one that uses locality-sensitive hashing to improve the complexity12,13. These are similar approaches to what DeepMind pursued in 2021 where they used cross attention to essentially compress the inputs before the bulk of the transformer computation, allowing for extremely long input sequences10,. In my mind, the main take away from these kinds of approaches is that if you can compress your information somehow, you can really reduce the complexity of the problem, which can allow for more compute resources to be applied to smaller dimensionality data. This also highlights the importance of feature engineering and problem simplification. Feature engineering can be thought of as doing some of that pre-computation on your input data, using domain knowledge to amplify important features, and reducing the computation needed to figure that out automatically from less refined data. Problem simplification can kind of be thought of as approximation on the back end. If a task is very complex, perhaps trying to answer a binary approximation of the output variable would be an easier task at first. You may even be able to use a simplified task as an auxiliary head of the network. For example, if you trained a network using earlier layers to predict a binary output variable that is normally a multi-class output variable, that may help bias the training so that deeper layers can have a better understanding and easier path towards a solution of the more complex problem.
So improving the attention layers by either approximation, compression, or theoretical improvements are all valid ways, but Nvidia has also considered these in addition to hardware based improvements with their “Faster Transformer” implementation, where they built a transformer on top of CUDA, cuBLAS, cuBLASLt, and C++ and provide APIs to TensorFlow or PyTorch. They are essentially utilizing their “tensor cores” more effectively in their GPU offerings to speed up the speed of transformers in various types of architectures14. Nvidia is doing a lot of great work, but it does seem that Google’s TPUs are still state of the art when it comes to power, scalability, and efficiency, since they are customized chips designed for this purpose.
How the Versatility of Transformer Architecture Impacts AAV Gene Therapy Developers
I know this is only a taste of what’s out there for transformer architectures, and with enough articles we can eventually catch up to state of the art, but understanding the steps taken to get to where we are now can better prepare us for innovating further.
And how is all this information relevant to the gene therapy developers? Well, building on the latest developments in Transformer architecture variants, we can use our data resources more powerfully than ever before. Adeno-associated virus (AAV) gene therapy development is often constricted by a lack of data but through targeted optimization and customized solutions19, these new architectures help us to make maximum use of limited data and expedite time-to-market20 for such therapies. This means that even complex problems like AAV capsid truncations21 can be solved with greater precision and efficiency.
If you would like a sneak peek at other architectures and modifications that may be talked about and discussed in future articles, check out this very good catalog of transformers16 ,and a great survey paper on transformers17.
Want to stay up-to-date on the latest Transformer architecture variants?