The Dataflow
A typical workflow for the genomic data assumes loading the data into the database (Oracle, Postgres, or MySql-compatible, such as MariaDB). A variety of files in common bioinformatic formats are parsed, the data is indexed, compressed, and stored in a system specifically designed for fast retrieval. The typical data types and file formats are:
- reference genomic sequence(s) (from FASTA or GENBANK formats)
- linkage groups (CSV or tab-delimited files) with marker positions in cM
- gene annotation tracks (GFF3, GTF, bed, or GENBANK files)
- marker tracks – features with a name and position on the genome, can be large regions (CSV or tab-delimited files)
- quantitative trait loci (QTLs) (CSV or tab-delimited files)
- quantitative tracks (bedgraph, bigWig, wiggle, bedmethyl)
- individual resequencing data (variants: SNPs and indels) (VCF files)
- orthologs (paralogs) that help link syntenic maps (tab-delimited files)
- synteny regions (could be based on BLASTN, minimap or mummer output) (CSV, chain files)
- whole-genome precomputed TBLASTN tracks (tab-delimited NCBI-BLAST output)
- BAM/CRAM files
The main data-processing component that understands these formats and uploads the data into the database is called PersephoneShell. It reads the given data files following the instructions from an INI-formatted control file, that specifies how to interpret the data. For example, an INI file may contain the rules on how to parse the FASTA headers or which attributes from GFF3 to load, etc. PersephoneShell can run in an interactive mode displaying prompts or helping with auto-completion of the commands. Alternatively, PersephoneShell can be included in scripts and perform the commands in batch mode.
The main Persephone client application runs in a web browser.
Storing the data in the database has several advantages, especially in the corporate environment, namely:
– consolidating a variety of files in a central company-wide repository;
– using common nomenclature, which helps with an inventory of bioinformatic assets. For instance, it is easy to see which primer or probe sequences are associated with a marker or which genomes the marker has been mapped onto;
– pre-processing large volumes and optimizing storage and retrieval of data. The system has been used with billions of SNPs and millions of maps and markers.
When browsing the data from the database with Persephone, the users can additionally drag/drop their own external files to visualize custom tracks. A powerful export engine also allows analyzing or exporting entire data sets, such as, for example, promoter regions for a list of genes with common functions or all protein sequences generated from gene models predicted on a genome.
The Persephone software stack is typically installed using a Docker image that contains all necessary components.
Please see the complete documentation on the system setup, working with Persephone, and loading the data.