Informed Insights

September 27, 2022

Opportunities for Applying Today’s Computational Advances to Yesterday’s Analysis Techniques


Brandi Cantarel, PhD

Dr. Brandi Cantarel is Form Bio's Director of Bioinformatics, where she focuses on developing analysis tools for the platform. Brandi's career has included leading gene of interest analysis as part of the NIH Human Microbiome Project, leading the bioinformatics/software development team to launch a CAP/CLIA Cancer Genomics laboratory and being a 2020 Highly Cited Researcher and she holds a PhD and MS from  the University of Virginia.

The term “bioinformatics” was coined by Paulien Hogeweg and Ben Hesper, in 1978, to describe “the study of informatic processes in biotic systems” (1). 

In this original definition is described the information processing inside of organisms or in biological systems.  Although today one might define it as a discipline that combines biology, mathematics and computer science to acquire, store, analyze and disseminate complex biological data.

Biologists want to understand how the contribution of genetics and environment leads to differing phenotypes by measure of biomarkers, images, health, or physical features of biological systems ranging from the cellular level, in tissues and organ systems or in organisms or communities of organisms.  Bioinformatics is a tool that uses computer algorithms and statistics to help biologists to interpret complex biological data that cannot be easily understood. Complementary disciplines in bioinformatics include: (a) database design and data mining, (b) macromolecular geometry, (c) molecular evolution prediction, (d) protein structure and function prediction, (e) genome annotation, and (f) biomarker data clustering.

Sequence analysis is a sub-discipline of bioinformatics that involves DNA and protein sequence comparison to determine molecular evolution, gene function and protein structure.  Sequence analysis is a technique that is used in molecular evolution prediction, protein structure and function prediction and genome annotation.  Central to sequence analysis is the idea that genes that share excessive similarity are homologous, or homologs, meaning they share a common ancestor, and therefore should share similar molecular functions and have similar protein structures.  Orthologs are genes that diverged following a speciation event and are believed to be “equivalent in their function” between species.  Paralogs are genes that diverged following a gene duplication event and are believed to have evolved new functional capabilities since these genes aren’t under the same evolutionary constraints and even small genetic changes can lead to loss or gain of function.  We don’t have to wait millions of years to see how genetic changes can impact gene function. In the case of cancer, tumor cells can evolve rapidly and with just a few single nucleotide mutations can become resistant to treatment.  Even though sequence similarity is used to predict cellular function, these predictions must be validated using independent computational and experimental methods.   While sequence analysis is probably one of the most visual sub-discipline within bioinformatics, advances in machine learning have been slow to evolve many of the field's early developed techniques.

Foundational Work in Sequence Analysis

The evolution of early sequence-analysis bioinformatics largely followed the evolution of sequencing technologies.  In 1977, Frederick Sanger and colleges developed a method of sequencing that involved the introduction of random termination in replication (2).  In 1979, Margaret Dayhoff compiled one of the first protein sequence databases, manually aligning proteins in the same gene families and creating the first protein substitution matrix (3).  In 1980, the first version of the phylogeny inference package (PHYLIP) was released, which allowed for inference of evolutionary trees (4).  In 1981, Temple Smith and Michael Waterman proposed an optimal alignment algorithm for local alignments (5). Through the 1980s and 1990s, sequence alignment tools such as FastA in 1985 (6), Blast in 1990 (7), PUMA in 1992 (8), HMMER in 1995 (9), Gap-enabled BLAST and PSI-Blast in 1997 (10).  These tools allowed researchers to compare proteins and organisms to understand the molecular evolution of genes and resulting protein sequences.  These advances in sequence comparison along with increased sequence diversity in sequence repositories, in some cases changed our taxonomic classification of organisms and species to match the evolution of the genes and genomes.

Starting in 1995, the first whole genome sequences of bacterial organisms became available due in part because of advances in DNA Sequencing and the development of a technique called Whole Genome Shotgun (WGS) (11,12).  WGS sequencing is a technique where DNA molecules are fragmented and then the pieces are “stitched back together” in a process called genome assembly using sequence similarity.  Genes in the bacterial genome were originally predicted using pattern matching methods, where open reading frames (ORFs) are defined by sections of the genome beginning with a start codon (ATG) and ending with a stop codon (TAG, TGA, TAA).  However, this simple method can produce lots of false positive gene predictions.  Researchers then needed a set of rules to sort out the likely false positives from the real genes.  But as in all things biology, there wasn’t a universal set of rules to sort true genes from falsely predicted genes.  Therefore researchers develops machine learning methods to create a model of gene structure using sequence similarity to known genes.  Messenger RNA (mRNA) is transcribed from genes and is then translated into proteins.  Gene models were developed using either DNA sequences of well-studied genes or from mRNA.  In eukaryotes, gene finding is complicated by sequences that separate the coding parts of genes called introns, which are not present in bacterial genomes.  Introns are spliced out of mature mRNA so that they do not get translated into amino acids.  In eukaryotes, having a model to predict genes is essential for determining intron-exon structure.

Once researchers know the gene composition of an organism, they can investigate gene content, functional capabilities of an organism, including ability to thrive in certain environments and determine the relationship between the organism’s genes and traits.

Opportunities in Artificial Intelligence Advances

Today sequence analysis is applied to many different areas of biology and medicine including: (a) precision medicine; (b) de-extinction and (c) pandemic surveillance.  Genomic sequence analysis is used by physicians to determine the molecular mechanism driving a disease phenotype such as cancer or rare phenotypes in babies in the NICU.  Based on differences in gene sequences between healthy subjects and patients, physicians are able to use targeted therapies to extend life and in some cases cure patients.  Thanks to advancements in gene editing, scientists can edit the genomes of present animals to “bring back” extinct animals by using comparison of the gene sequences to make the genes of the modern animal similar to the extinct animals, to alter physical and physiological traits.

While there have been tremendous advances in genome sequencing over the last 15 years, many of the techniques for genome annotation are not so different than in the early 2000s.  Gene features, such as repeats and genes, are still largely based on sequence comparison to databases of known features.  In the case of genes, those comparisons are used to build an organism specific model describing the intron-exon structure for that organism.  However, I wonder if it would be possible to have more generalized models, even for a certain class or family of organisms such as mammals or even marsupials using more modern techniques in AI such as deep learning.  These models could be trained on existing genome annotation and applied to newly sequenced reference genomes.  The vertebrate genome project aims to sequence ~ 70,000 extant vertebrate genomes from diverse species.  Newly trained models would reduce costs by alleviating the need for additional data including RNASeq to generate species specific gene models.

Once these reference genomes have been annotated, researchers can better interrogate the relationship between genotype and phenotype by low-cost whole genome sequencing of diverse members of that species with differences in physical and physiological traits.  However, some of the tools today make assumptions about genome size, which are not faulty in a whole of diverse vertebrate genomes.  For example, most marsupial genomes are composed of less larger chromosomes than the human and mouse genomes.  Because of constraints of available servers in 2010s, many tools for determining genetic variation and manipulating genome alignments have limits on the largest chromosomes.  Therefore these tools need a make-over to reflect the current computational capabilities of today, which not only include computers with more memory and processing capabilities but also graphical computing units (GPUs), which are designed to manipulate memory.  

With advances in genome annotation and genome variant analysis, we will gain greater insights into genetic changes that drive differences in traits among animals, not just species by species, but perhaps identify genotype/phenotype relationships that are cross-species and therefore perhaps learn a little more about human diversity.

Interested in modernizing your computational abilities? Schedule a demo with our team.


  1. Hogeweg, P and Hesper B.Interactive instruction on population interactions. Comput Biol Med 8(4):319-27 (1978).
  2. Sanger F, Nicklen S, Coulson A R. DNA sequencing with chain-terminating inhibitors, PNAS, 74 (12) 5463-5467 (1977).
  3. Schwartz R and Dayhoff M. Response: Protein and Nucleic Acid Sequence Data and Phylogeny. Science. Vol 205, Issue 4410. pp. 1038-1039.
  4. Felsenstein, J. PHYLIP - Phylogeny Inference Package (Version 3.2). Cladistics 5: 164-166.  The Quarterly Review of Biology Volume 64. No. 4 (1989).
  5. Smith, T. F. and Waterman M.S. Identification of common molecular subsequence. J Mol Biol. 147(1):195-7 (1981).
  6. Lipman D.J. and Pearson, W.R. Rapid and Sensitive Protein Similarity Searches. Science. Vol 227, Issue 4693. Pp. 1435-1441 (1985).
  7. Lipman D.J. et al.  Basic local alignment search tool. Journal of Molecular Biology. Vol 215, Issue 3, pp. 403-410 (1990).
  8. Caro, TM et al. Worldwide prevalence of lentivirus infection in wild feline species: epidemiologic and phylogenetic aspects,  J Vir Oct;66(10):6008-18 (1992).
  9. Eddy, S.R. and Durbin, G.M., Maximum Discrimination Hidden Markov Models of Sequence Consensus. J Comput Biol. Spring; 2(1):9-23 (1995).
  10. Lipman, D.J. et al., Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research. Volume 25, Issue 17, 1 pp. 3389-3402 (1997).
  11. Venter, J.C. et al.  The minimal gene complement of Mycoplasma genitalium. Science. Oct 20;270(5235):397-403.
  12. Merrick J.M., et al.  Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science. Jul 28;269(5223):496-512.

Form Bio Resources