SimulConsult, Inc. Home
Fields in the Genome-Phenome Analyzer

Variant file format and field values

The SimulConsult Genome-Phenome Analyzer imports genomic data from an annotated variant file in plain text format.  The details of the file are specified on this page and the sample file Trio.txt can be examined as a guide.  It is most convenient to examine the file in a spreadsheet program; accordingly it is best to use a file extension of .txt or .tsv.  To facilitate examination in a spreadsheet program, values in the file are tab-separated, allowing the data to be human-readable as columns and "cells".  Be careful, however, when saving the file because Excel will convert the gene SEPT9 or the zygosity 1/1 to a date.  The free spreadsheet Libre Office Calc is used by many genomics groups because it does not perform many such "helpful" distortions done by Excel.  Newline characters are used in the file to separate lines (any of the newline conventions, LF, CR+LF, or CR, are allowed if used consistently the file).

The structure of the variant file is as follows:

Row 1: the first "cell" should be:
fileformat=generic
or some other fileformat assigned specifically to your group.  This is followed by newline character(s) as discussed above.

Row 2: for a trio this has the 43 field columns headers as listed below.  The column headers are tab-separated except for newline character(s) at the end.  All headers must be included, even if blank (e.g., when not using a particular conservation score) or not having an individual in the proband trio (e.g., for a proband only with no parents, the parental columns are blank except for the headers).  If there are any individuals beyond the proband trio, new columns are added after those for the proband's father.  For example, if the proband, father and proband's sister are included, there would be 4 zygosity column headers: zygProband (with data in rows below), zygMother (header without data), zygFather (with data in rows below), and zygSister (with data in rows below).  Additionally, for each individual beyond the trio, 3 column headers and the data in rows below would be added (total depth, variant depth and quality).

Rows of rows of variant data follow the headers as row 3 and so on (tab-separated, newline at end).  Many of the fields are not required and can be left blank if you don't plan to use the related functionality, but even if a field is blank, you need to include the tabs in the variant rows and the headers in row 2.  Examining the sample Trio.txt file illustrates how several columns can be blank but headers are used anyway.

If your data has values that we don't support, such as values for the "effect" field, let us know, and we can support your values.  In many instances blank, "." , "-", -9 and -99 are allowed, and interpreted as not used.  In the fields indicated, "NA" and "na" also count as blanks.

The software uploads and analyzes variant files with ~20,000 variants in ~1 second.  The "Variant file processing" textarea on the File page of the software will report the results of file reading or problems encountered, for example stated gender not matching chromosomal gender.

Field (column header) Sample value Required Action Comments
hgncSymbol GBA
yes (all)
label Multiple symbols separated by commas are supported
geneNameLong glucosidase, beta, acid
no
report  
chrPos
(or specify the genome assembly using HG19:chrPos (the default) or HG38:chrPos)
chr1:155206167 or 1:155206167
yes (all)
compute

The chromosome number and position are displayed in the gene variants display, hyperlinked to the UCSC genome browser, using the genome assembly indicated in the header using the format shown at left.  If the Alamut genome browser is open, that is used, using its settings.  The chromosome number is also used in choosing the inheritance model used in the Gene Discovery display.  Also, unusual distributions of variants over the chromosomes are reported in the variant table processing text area.  Entries may start with chr or the number or letter for the chromosome.  For chromosome designations, 23 or X, 24 or Y, 25, M or MT are supported.  Characters past a colon are used to display position information. 

cSeqAnnotation NM_014208:ex5 NM_014208:ex5:c.A3447T:p.E1149D
yes for some of the 8 fields beginning with this one
link Text (if desired, can combine this + next 7 sequence annotations in this field, and multiple sequence annotations separated by commas are supported). If a DNA position and change is recognized, it is used to construct a ClinVar query.
cPosition 38
no
link If this and the next 2 fields are included, ClinVar URLs for these variants are displayed instead of generic URLs for the gene
cRef A
no
link If this and the fields before and after are included, ClinVar URLs for these variants are displayed instead of generic URLs for the gene
cAlt G
yes, if using zygosities with more than one non-wildtype form
link If this and the previous 2 fields are included, ClinVar URLs for these variants are displayed instead of generic URLs for the gene
pSeqAnnotation NP_00100574.1 or p.E1149D or E1149D
no
link If no DNA position and change is recognized, the protein change is used to construct a ClinVar query.
pPosition 13
no
report  
pRef K
no
report  
pAlt R
no
report  
rsid rs12345678
no
compute The percent of variants with rsID numbers is reported in the variant table processing text area.
zygProband (to use identifiers within the software, use terms such as zygPaula here and Paula will be used as the identifier within the software) Het
yes
compute Zygosity of proband: Accepts non-negative integers from 0-100, or case-insensitive text (het, heterozygous, hap, hom, homozygous, hemi, wt).  Inputs treated as wt: unknown, ., -, none.  Inputs of the form x/x (or the phased equivalents using |) are treated as hom for nonzero x; otherwise wt.   Inputs of the form x/y can contribute to compound hererozygotes, even at the same locus.  If the / or | forms are used, 1 (with no pipe or slash) is interpreted as hap (hemi).
zygMother (to use identifiers within the software, use terms such as zygINDIV_35 here and INDIV_35 will be used as the identifier within the software) 50
no, if no genomic data is available from the proband's mother leave this blank but use a header
compute Zygosity of mother, as above
zygFather (to use identifiers within the software, use terms such as zygGeorge here and George will be used as the identifier within the software) 0 no, if no genomic data is available from the proband's father leave this blank but use a header compute Zygosity of father, as above
(for beyond the trio, the next additional zygosity column goes here)
effect missense
yes (most)
compute

Effect terms that are recognized are listed at this link, though each group typically uses only a few of these, such as the core terms missense, frameshift, and synonymous.  The listing here is periodically updated to include all SnpEff effect prediction terms, but let us know if we are missing any.  The terms are case-insensitive.  If multiple terms are used (separated by "|" or "&" or "," or "/"), each effect term is considered and the one with the highest severity is chosen. 

freq1 (to use identifiers within the software, use terms such as freqExAC here and ExAC will be used as the identifier in the mini variant table) .0015
no
compute Real numbers between 0-1, use the main frequency metric of your choice (e.g., 1000genomes).  "NA"and "na" are interpreted as 0, in addition to the usual blank characters.
freq2 (to use identifiers within the software, use terms such as freqAfrica here and Africa will be used as the identifier in the mini variant table) .02
no
compute Real numbers between 0-1, as above.  On the Set Variant Parameters screen these can be chosen.
homoShares 0
no
compute Count of times this particular mutation seen in homozygous form in unaffected individuals (for example, at your lab) (non-negative integers)
heteroShares 3
no
compute Count of times this particular mutation seen in heterozygous form in unaffected individuals (for example, at your lab) (non-negative integers)
omimNumber 606463
no
report Six digit number corresponding to the gene
omimDiseaseNames Gaucher disease
no
report Multiple diseases can be strung together for display
variantAccession CM065215
no
report Variant accession number
variantPathogenicity DM or 5
no
compute Variant pathogenicity report.  For HGMD varient pathogenicity, DM is treated as severity 5, DM? as 4, and DP as 3.  ClinVar values from 2 to 5 and their verbal equivalents (capitalized or uncapitalized) are treated as follows: 2 or Benign are severity 1, 3 or Likely benign are severity 2, 4 or Likely pathogenic are severity 3, and 5 or Pathogenic are severity 5.  These values override other variant severity determinations, though the other scoring is described in the mini variant table that is displayed for non-benign variants.
polyPhen probably-damaging
no
compute Either verbal (probably-damaging, possibly-damaging, benign, DP, FP, DFP, D, P, B), or numerical (real number between 0 and 1, with damaging values near 1) but not both.  If two terms are present the first one is used.
mutationTaster 0.73
no
compute A, D, N, P (case-insensitive) or real number between 0 and 1, with damaging values near 1. 
sift 0.68
no
compute Either verbal (D, T (case-insensitive)) or numerical (real number between 0 and 1, with damaging values near 0 (can be configured for customers to use damaging values near 1)) but not both.  If two terms are present the first one is used.
gerp 5.34
no
compute Real number, with higher numbers more damaging.
grantham 29
no
compute 0-215, with higher numbers more damaging.
phat 0.82
no
compute Real number between 0 and 1, with higher numbers more damaging. "NA"and "na" are interpreted as 0, in addition to the usual blank characters.
phast 0.95
no
compute Real number between 0 and 1, with higher numbers more damaging. "NA"and "na" are interpreted as 0, in addition to the usual blank characters.
phyloP

0.75

no
compute Rankscore real number between 0 and 1, with higher numbers more damaging. "NA"and "na" are interpreted as 0, in addition to the usual blank characters.
strandBias   no    
knownSplice   no compute Real number between -1 and 1 reflecting disruption of a known splice site, with values near +1 being more damaging.
totDepthP 105
no
compute Total read depth for proband; non-negative integer. Blank or low values result in exclusion of the variant, but blanks for all variants result in this metric being ignored.
varDepthP 74
no
report Variant read depth for proband; non-negative integer. 
qualP 162
no
compute Read quality score for proband; non-negative integer. Blank or low values result in exclusion of the variant, but blanks for all variants result in this metric being ignored.
totDepthM 99
no
report Total read depth for mother; non-negative integer. 
varDepthM 99
no
report Variant read depth for mother; non-negative integer. 
qualM 99
no
report Read quality score for mother; non-negative integer. 
totDepthF 250
no
report Total read depth for father; non-negative integer. 
varDepthF 99
no
report Variant read depth for father; non-negative integer. 
qualF 99
no
report Read quality score for father; non-negative integer. 
(for beyond the trio, the next additional total depth column goes here)
(for beyond the trio, the next additional variant depth column goes here)
(for beyond the trio, the next additional quality column goes here)