About Persephone

Persephone® has been originally designed as a genome viewer for gene predictors. It has allowed users to align various types of evidence with gene models and assess the quality of gene structure predictions.

The tool later evolved to visualize genetic maps, QTLs, SNPs, RNA-seq, synteny, etc., and to facilitate fast navigation in the ever-expanding world of genomic information.

Today, Persephone is a state-of-the-art industrial-scale genome browser, capable of rapidly showing large data sets due to unique compression algorithms, optimized data transfer, and a fast-rendering engine that engages some cutting-edge technologies borrowed from the gaming industry.

The entire Persephone framework consists of a database, an API server that handles communications, a BLAST and text search servers, a loader utility to populate the database, and the main client application. Users can also quickly visualize their data without loading it to the database by dragging and dropping local files or reading them from remote locations via URL.

Persephone is used in large corporations and academic institutions on a daily basis and continues to uphold its reputation for stability, power, and versatility.

The Dataflow

A typical workflow for the genomic data assumes loading the data into the database (Oracle, Postgres, or MySql-compatible, such as MariaDB). A variety of files in common bioinformatic formats are parsed, the data is indexed, compressed, and stored in a system specifically designed for fast retrieval. The typical data types and file formats are:

  • reference genomic sequence(s) (from FASTA or GENBANK formats)
  • linkage groups (CSV or tab-delimited files) with marker positions in cM
  • gene annotation tracks (GFF3, GTF, bed, or GENBANK files)
  • marker tracks – features with a name and position on the genome, can be large regions (CSV or tab-delimited files)
  • quantitative trait loci (QTLs) (CSV or tab-delimited files)
  • quantitative tracks (bedgraph, bigWig, wiggle, bedmethyl)
  • individual resequencing data (variants: SNPs and indels) (VCF files)
  • orthologs (paralogs) that help link syntenic maps (tab-delimited files)
  • synteny regions (could be based on BLASTN, minimap or mummer output) (CSV, chain files)
  • whole-genome precomputed TBLASTN tracks (tab-delimited NCBI-BLAST output)
  • BAM/CRAM files

The main data-processing component that understands these formats and uploads the data into the database is called PersephoneShell. It reads the given data files following the instructions from an INI-formatted control file, that specifies how to interpret the data. For example, an INI file may contain the rules on how to parse the FASTA headers or which attributes from GFF3 to load, etc. PersephoneShell can run in an interactive mode displaying prompts or helping with auto-completion of the commands. Alternatively, PersephoneShell can be included in scripts and perform the commands in batch mode.

The main Persephone client application runs in a web browser.

Storing the data in the database has several advantages, especially in the corporate environment, namely:
– consolidating a variety of files in a central company-wide repository;
– using common nomenclature, which helps with an inventory of bioinformatic assets. For instance, it is easy to see which primer or probe sequences are associated with a marker or which genomes the marker has been mapped onto;
– pre-processing large volumes and optimizing storage and retrieval of data. The system has been used with billions of SNPs and millions of maps and markers.

When browsing the data from the database with Persephone, the users can additionally drag/drop their own external files to visualize custom tracks. A powerful export engine also allows analyzing or exporting entire data sets, such as, for example, promoter regions for a list of genes with common functions or all protein sequences generated from gene models predicted on a genome.

The Persephone software stack is typically installed using a Docker image that contains all necessary components.

Please see the complete documentation on the system setup, working with Persephone, and loading the data.

What makes Persephone unique?

Designed for very large data sets – Persephone has been designed to optimize storage, access, and visualization of genetic information for large genomes. For example, plants contain great complexity in their genetic information. Healthcare has lagged behind agriculture in the use of genetic information due to difficulties accessing large population sizes, which is commonplace in crops. With the amount of human genetic data exploding, as genomics moves into the clinic, Persephone offers the scalable solution that can handle terabytes of data required to be easily accessed.

Optimized data compression and memory usage – Knowing the characteristic of genetic data enables Persephone to provide compression rates that are better than the standard ZIP compression. This same compression capability allows Persephone to optimize memory usage across the system and provides handling large files in real time.

Capacity to scan whole genomes – Currently, many genetic tests consist of a panel of one or a few genes. Whole genome sequencing is demonstrating that while there are a few mutations that are common to particular diseases, there are many mutations that are unique to individuals. Just as radiologists are necessary to interpret the subtle differences and indicators in X-ray or MRI data, it is likely that clinical geneticists will need tools to interpret whole genomes and their individual differences, not just panels of a few genes. As new genetic discoveries are made, Persephone enables the rapid identification of afflicted individuals without having to rescan whole genomes.

Intuitive user interface – Persephone’s easy-to-use, intuitive interface allows non-expert users to utilize familiar point-and-click, drag-and-drop, cut-and-paste, and zoom functions to explore and compare genetic information. To enable better interpretation, Persephone organizes genetic information for easy visualization, search, filtering, and comparison. Persephone’s high-performance graphics provides a smooth, animated genome visualization.

If you would like to try the web version, please follow this link. A quick-start guide is available here.

Four chromosomes from rice, sorghum, brachypodium and corn are aligned using ortholog pairs