As biological data has gone digital, with terabytes of sequence data being stored on servers worldwide, several different file types and formats have arisen. Initially, simple text files (think your regular, old .txt files) were used for storing sequence data using the single nucleotide or amino acid code. Yet, these have significant limitations: Plain text files can’t be annotated with chromosome, quality, functional, or other information required in modern-day bioinformatics. Today, a plethora of different file formats are used, from the simplest FASTA format, which includes sequence data with a description, to more complex formats such as General Feature Format (GFF), which displays detailed genomic features. Let’s look at the evolution of sequence file formats in bioinformatics.
Here’s what we’ll discuss:
- The Different Bioinformatics File Types
- Why are There so Many Different Types?
- File Formats and BLAST
- File Format FAQs
The Different Bioinformatics File Types
The FASTA bioinformatics tool was invented in 1988 and used for performing sensitive sequence alignments of DNA or protein sequences.1 It’s associated file type – FASTA format – has become a standard file type in bioinformatics.2 The rise of sequencing technologies and the development of robust bioinformatics analysis tools have given rise to several others. And if you get involved in using other sequence alignment tools in bioinformatics or other types of sequence analysis, you are sure to encounter and use them extensively.
Take a look below at some of the more popular file types, what they look like, and where they’re commonly used.
The FASTA file format is the simplest way of representing nucleic acid of protein sequences using single-letter codes for nucleotides or amino acids.1,2 For each sequence, there are two lines:
- The first is a sequence identifier, which contains information about the sequence, preceded with a “>” symbol. If you retrieve a sequence from GenBank, SWISS-PROT, BLAS, or another database, the identifier will follow a standardized format.
- The second line in a FASTA file is the nucleotide or amino acid sequence, using single letter IUPAC codes.3
These file types, denoted by the .fas extension, are used by most large curated databases. Specific extensions exist for nucleic acids (.fna), nucleotide coding regions (.ffn), amino acids (.faa), and non-coding RNAs (.frn).4
A FASTA file can contain one or many sequences. Tools like ClustalW can take FASTA files with multiple sequences to generate an alignment. Converting between FASTA formats and any of the others discussed below can be done with programs like Seqret and MView.5 Other simple sequence file formats that you may encounter include GCG and IG.
The FASTQ format was developed for and used with next-generation sequencing instruments and builds off of the simplicity of the FASTA format. Information about the quality (“Q” in “FASTQ” stands for quality) of the sequencing reads and base calls are a defining component of the FASTQ file format.
For each sequence within a FASTQ file, there are four lines:4
- The first is a sequence identifier and description. It begins with an “@” symbol followed by information about the sequence. There is a standardized format with Illumina sequencers that includes the unique instrument name, the flowcell lane, etc.
- The second line contains the raw sequence data as with a FASTA file
- The third line includes the “+” symbol along with a repeated identifier
- The fourth line is the quality score for each base in the sequence on the second line and must be the same length
The quality score on the fourth line is the Phred score (Q), formatted as a single ASCII character.6 Q is calculated in different ways and ranges, depending on the platform used for sequencing, and is the probability that a specific base call in a raw sequence is incorrect.6 In its most straightforward calculation, which is used for Sanger sequencing:
Q= -log10p; where p is the probability that the base call is incorrect.
The larger Q is, the higher the base call accuracy is. For example, a Q of 20 means that base call is incorrectly identified every 100 base pairs. A Q of 30 means that a base call is incorrectly identified every 1000 base pairs. FASTQ file formats typically have the file extension .fastq, .sanfastq, or .fq, though there is no standard.
BAM files (with the .bam file extension) are closely related to SAM files, which are tab-delimited text files used for storing sequence alignment data. The advantage of the BAM file format over the SAM file format is that it’s a compressed binary version that is smaller in size and can be indexed, making them ideal for the storage of sequence alignment information and preferred for the Integrative Genomics Viewer.7,8
Like most file formats used in bioinformatics, BAM files contain a header and a body. The header stores information about the sequences, preceded by an “@” symbol. The body contains information about how each sequence aligns with a specific reference sequence.9 Each alignment line includes 11 data fields, including Phred score, a string that describes alignment called CIGAR, and other metadata. Read the detailed specifications for a full description of conventions used in BAM and SAM files.9
A set of BAM files can also be combined into a single track, called a merged BAM file with the file extension .bam.list.10
SAM files, which stand for sequence alignment/MAP and are denoted by the .sam file extension, were initially derived from a piece of bioinformatics software called SAMtools, an open-source program for viewing alignments.9 For more details on SAM files, read about the binary version of this file format, BAM files, above.
CRAM files, another file type related to BAM file formats, are a restructured version of BAM files that enables lossless compression.11
VCF (Variant Calling Format; file extension .vcf) files store gene sequence variation, such as single nucleotide polymorphisms (SNPs), and is used in genotyping projects.2,12 It contains a header with metadata preceded by a “##” string. Best practices with VCF files recommend describing INFO, FILTER, and FORMAT entries used in the body within the header.
Following the header is the body, made up of 8 mandatory columns, one for each identifier:11
- CHROM: An identifier for the reference genome
- POS: The reference position, with the 1st base having position 1.
- ID: List of unique identifiers. For dbSNP variants, the rs number should be included
- REF: The base in the reference sequence, either A, C, G, T, or N
- ALT: the alternate non-reference alleles
- QUAL - Phred quality score
- FILTER: This can include quality or any other filters that have been applied to the data
- INFO: Additional information such as ancestral allele, the total number of alleles in the genotype, etc.
For a complete guide to using VCF file formats, read the VCF specifications.12
Generic feature formats
A GFF (general feature format; file extension .gff2 or .gff3) describes the various sequence elements that make up a gene and is a standard way of annotating genomes.13 It defines the features present within a gene in the body of the GFF file, including transcripts, regulatory regions, untranslated regions, exons, introns, and coding sequences. As with the VCF, it uses a header region with a “##” string to include metadata.
There are nine mandatory columns, separated by a tab that includes a:13
- Reference sequence: An ID for the reference sequence being annotated
- Source: The source for the annotation of a feature (i.e., a database such as GenBank)
- Feature: The type of feature (i.e., exon, gene, coding sequence, etc.)
- Start: Start of feature using genomic coordinates
- End: End of feature using genomic coordinates
- Score: A confidence score for the feature (i.e., an E-value if the feature is based on sequence similarity)
- Strand: Defines if the feature is on the “+” or “-” strand
- Phase: Anchors the feature with respect to a reading frame and denotes the number of nucleotides – zero, one, or two – that would need to be removed to reach the next codon
- Attribute / Group: Can be used for alternative feature names, notes, or to group similar features
The GTF (gene transfer format) file type shares the same format as GFF files, though it is used to define gene and transcript-related features exclusively.14 The attribute/group field, described above, is used in the GTF to include a gene_id or transcript_id value, a unique identifier for the genomic source or predicted transcript, respectively.
The BED (Browser Extensible Data) file format includes information about sequences that can be visualized in a genome browser; a feature called an annotation track.15 BED files are tabs-delimited and include 12 fields (columns) of data. The columns must be consistent throughout each file’s rows to be correctly read.
Only 3 of the fields, the chrom (i.e., name of chromosome or scaffold), the chromStart (i.e., the starting position of the chrom with 0 being the first base), and the chromEnd (i.e., the ending position) are required for specific genome browsers, such as the UCSC browser.16 The other nine fields are optional and include details such as the name of the BED feature, the strand for the annotation track, and additional ancillary information.
The Tar.gz format (also called a “Tarball”) is a compressed file type that can store bioinformatics software or raw data.
PDB file formats contain atomic coordinates and are used for storing 3D protein structures by the Protein Data Bank. For more information on the format, see the pyMOL see the full guide.17
PED (.ped file extension) is a file format for pedigree analysis, which creates a familial relationship between different samples.18 It’s used with the PLINK command-line bioinformatics program.
MAP (.map file extension) is a file format that accompanies the PED file format when using the PLINK program.19 It contains variant information.
CSV (.csv file format) files stands for comma separated value and is a text file, where each line is a row and columns are delimited with a comma.20 It can store different types of sequencing data and can be opened using common spreadsheet programs like Microsoft Excel.
Why Are There So Many Different Types?
In modern-day, the many different ways of generating and using sequencing data have given rise to the sequence file formats described above. These file formats have their own specific use cases depending on:4
- Compatibility with specific software
- Data processing, parsing, and human readability needs
- Efficiency for storage
There are several similarities and differences between several of the file types:
- Generally, all of the different file formats are similarly structured: They contain a header with metadata and a body with lines or fields of data.
- FASTA and FASTQ are used to store raw sequencing data, yet the FASTQ file also holds quality data. FASTA also can store DNA, RNA, and protein sequences, while FASTQ usually only contains DNA sequences.
- SAM, BAM, and CRAM files are all similar in that they store sequence alignment data but differ in their compression status
- BED, VCF, and GFF/GTF don’t store raw sequence data and instead have various DNA feature annotations. These feature annotations differ between file types.
With this introduction to the standard file formats in bioinformatics, you should understand the structure of, biological data in, and purpose for each file type. The best way to get a handle on how to use these files in a practical, real-world setting is to start using them regularly and get acquainted with the vast array of bioinformatics tools and software available.
At Form.bio, you can use many of the common file types discussed here for data management, workflow implementation, visualization, and collaboration.
File Format FAQs
What are the common file formats in bioinformatics?
The FASTA file format is one of the most widely used bioinformatics file types. FASTQ is also used broadly due to the widespread adoption of next-generation sequencing. Other common file types include SAM, BAM, CRAM, BED, VCF, GFF, and GTF.
What is flat format in bioinformatics?
A flat file format is a table with a single record per line. FASTA and other file formats are an example of a flat file format in bioinformatics.
What are data types in bioinformatics?
Data types in bioinformatics can be DNA sequences, RNA sequences, amino acid sequences, methylation sequences, three-dimensional protein structures, and more.
Why do we have different sequence file formats?
Scientists are using bioinformatic data for many different purposes, and other file types include different kinds of information concerning a DNA, RNA, or protein sequence. These various file types may be used for compatibility with additional bioinformatics software or storage efficiency.