Vector Design

Leveraging AI Models To Improve Safety of Gene Therapy

Learn how predictive modeling can generate synthetic regulatory sequences for safer and more efficacious gene therapies.

Nick Ketz, PhD

Nick Ketz, PhD

May 3, 2023

Leveraging AI Models To Improve Safety of Gene Therapy

In gene therapy development and manufacturing, multiple steps can be optimized to improve the efficacy and safety of therapeutics. Transgenic viral vectors must be successfully and specifically delivered to the correct cells within a patient, and that transgene must also be expressed within those cells at the appropriate levels. 

Transgene expression is often modulated by changing the dosage to reach the desired level of expression, which adds additional manufacturing costs to an already expensive process. Being able to control the level of expression precisely would not only reduce costs but could increase the safety of the therapy by ensuring expression occurs at suitable levels within target cell types.

Training Models for Predicting Regulatory Motifs

Recent work by Zrimec at al., has used Generative Adversarial Networks (GANs) to design regulatory sequences to improve the control of expression2. In the schematic above, a convolutional Generator network \(G\) takes a latent encoding \(Z\) as input (e.g., random noise) and outputs a 1-hot encoded DNA sequence that mimics natural regulatory sequences. This output is then input to a convolutional Discriminator network  \(D\) along with validated regulatory sequences to provide a score comparing the two inputs. High scores imply that the generated sequences are similar to the natural sequences, while low scores indicate the two inputs are different. These two networks are optimized together, and the networks compete with each other while the generated output becomes increasingly similar to the natural sequences. Post-training generated sequences were compared with held-out natural sequences, and most synthetic sequences (86%) displayed properties similar to those of natural sequences.

A generative model of regulatory sequences alone is not sufficient for this task, as it can not predict different levels of expression for a given regulatory sequence.  To design sequences for controlled expression a model relating sequence to expression level is required. Recently, Zrimec et al. trained a separate neural network model to do just this, that is they trained a neural network  to predict expression levels based on natural genomic sequences from Saccharomyces cerevisiae spanning multiple types of regulatory regions (4,238 sequences of 1,000bp inputs spanning gene promoters, UTRs, and terminators).1 With these two models working together the design of regulatory sequences for controlled expression levels can be attempted.

Model Optimization

Based on the two models described above, an optimization procedure was used to exploit the generative model to produce regulatory DNA with target expression levels to guide the \(G\) network to deliver target sequences with desirable properties (e.g., predictable gene expression levels).2 This uses a Predictor model \(P\) to predict the expression level of a given sequence based on a pre-trained model of yeast gene expression described above.1 Since this Predictor takes as input the gene coding region in addition to the surrounding regulatory sequences, this coupling of \(P\) to \(G\) allows for gene-specific generation of regulatory sequences.

By differentiating through the \(P\) and \(G\) networks, a direction in the latent space, \(Z\) can be identified to either increase or decrease the output of \(P\) (e.g., the expected expression level) and iteratively arrive at maximum or minimum for a given starting sequence, as shown in the figure above. Therefore, the researchers were able to generate 6 orders of magnitude in expected expression levels for various generated sequences! This exceeds the natural range of expression by an additional 2 orders of magnitude, suggesting that some generated sequences can achieve higher expression levels than the natural sequences.

In Vivo Validation

In vivo validation of a selected set of the generated sequences was done over a range spanning 4 orders of magnitude of predicted expression levels. Critically, this set was restricted to generated sequences that displayed properties similar to natural sequences (e.g., promoter contains known binding motifs) and concomitantly exhibited low sequence similarity measures. Also, two regulatory sequences of the POP6 (predicted transcripts per million (TPM) of 64) and RPL3 (predicted TPM of 303) genes were used as low and high expression controls.

As seen in the figure above, observed experimental measurements of the mRNA levels produced by each generated sequence showed a good rank correlation with the predicted levels (Spearman’s ⍴ = 0.74); however, the difference between predicted and actual expression levels was relatively high. A 7.7-fold and 2.5-fold difference between predictions and TPM measurements was observed with the lower ‘gen-10’ (predicted TPM ~10, avg. measured TPM 77) and higher ‘gen-1000’ groups (predicted TPM ~1000, avg. measured TPM 397), respectively. Although the authors could not generate sequences with expression lower than the POP6 control, within the gen-1000 group, 4 out of 7 regulatory constructs (57%) displayed average expression levels that surpassed those of the natural highly-expressed RPL3 control by up to 2.7-fold. This demonstrates the method’s ability to design regulatory DNA that exceeds natural expression levels, although not at a precisely predicted level.

Tunable Transgene Expression and Safer Gene Therapies

Based on these results, the design of gene regulatory elements has promising potential for controlling a given gene's expression within a target cell. The path to developing this method in human genes is relatively straightforward, as existing expression prediction models (e.g., BPNet) can be used to drive the generation process.3

This method could increase transgene expression and reduce the titering of a given therapy during development and manufacturing, ultimately leading to increased efficacy. Future work in ai gene therapy could also use single-cell expression data to try and design for specific cell types, further improving the specificity and ultimate safety of a designed gene therapy viral vector.

AI Disclosure: Feature image was generated by an AI image tool, MidJourney.

Stay up-to-date on the latest AI gene therapy research

Sign up for our newsletter


  1. Zrimec J, Börlin CS, Buric F, et al. Deep learning suggests that gene expression is encoded in all parts of a co-evolving interacting gene regulatory structure. Nat Commun. 2020;11(1):6141.
  2. Zrimec J, Fu X, Muhammad AS, et al. Controlling gene expression with deep generative design of regulatory DNA. Nat Commun. 2022;13(1):5099. 
  3. Avsec Ž, Weilert M, Shrikumar A, et al. Base-resolution models of transcription-factor binding reveal soft motif syntax. Nat Genet. 2021;53(3):354-366. 

More to Explore