Home | |
Fields in the Genome-Phenome Analyzer |
Variant file format and field values
The SimulConsult Genome-Phenome Analyzer imports genomic data from an annotated variant file in plain text format. The details of the file are specified on this page and the sample file Trio.txt can be examined as a guide. It is most convenient to examine the file in a spreadsheet program; accordingly it is best to use a file extension of .txt or .tsv. To facilitate examination in a spreadsheet program, values in the file are tab-separated, allowing the data to be human-readable as columns and "cells". Be careful, however, when saving the file because Excel will convert the gene SEPT9 or the zygosity 1/1 to a date. The free spreadsheet Libre Office Calc is used by many genomics groups because it does not perform many such "helpful" distortions done by Excel. Newline characters are used in the file to separate lines (any of the newline conventions, LF, CR+LF, or CR, are allowed if used consistently the file).
The structure of the variant file is as follows:
Row 1: the first "cell" should be:
fileformat=generic
or some other fileformat assigned specifically to your group. This is followed by newline character(s) as discussed above.
Row 2: for a trio this has the 43 field columns headers as listed below. The column headers are tab-separated except for newline character(s) at the end. All headers must be included, even if blank (e.g., when not using a particular conservation score) or not having an individual in the proband trio (e.g., for a proband only with no parents, the parental columns are blank except for the headers). If there are any individuals beyond the proband trio, new columns are added after those for the proband's father. For example, if the proband, father and proband's sister are included, there would be 4 zygosity column headers: zygProband (with data in rows below), zygMother (header without data), zygFather (with data in rows below), and zygSister (with data in rows below). Additionally, for each individual beyond the trio, 3 column headers and the data in rows below would be added (total depth, variant depth and quality).
Rows of rows of variant data follow the headers as row 3 and so on (tab-separated, newline at end). Many of the fields are not required and can be left blank if you don't plan to use the related functionality, but even if a field is blank, you need to include the tabs in the variant rows and the headers in row 2. Examining the sample Trio.txt file illustrates how several columns can be blank but headers are used anyway.
If your data has values that we don't support, such as values for the "effect" field, let us know, and we can support your values. In many instances blank, "." , "-", -9 and -99 are allowed, and interpreted as not used. In the fields indicated, "NA" and "na" also count as blanks.
The software uploads and analyzes variant files with ~20,000 variants in ~1 second. The "Variant file processing" textarea on the File page of the software will report the results of file reading or problems encountered, for example stated gender not matching chromosomal gender.
Field (column header) | Sample value | Required | Action | Comments |
---|---|---|---|---|
hgncSymbol | GBA | yes (all) |
label | Multiple symbols separated by commas are supported |
geneNameLong | glucosidase, beta, acid | no |
report | |
chrPos (or specify the genome assembly using HG19:chrPos (the default) or HG38:chrPos) |
chr1:155206167 or 1:155206167 | yes (all) |
compute | The chromosome number and position are displayed in the gene variants display, hyperlinked to the UCSC genome browser, using the genome assembly indicated in the header using the format shown at left. If the Alamut genome browser is open, that is used, using its settings. The chromosome number is also used in choosing the inheritance model used in the Gene Discovery display. Also, unusual distributions of variants over the chromosomes are reported in the variant table processing text area. Entries may start with chr or the number or letter for the chromosome. For chromosome designations, 23 or X, 24 or Y, 25, M or MT are supported. Characters past a colon are used to display position information. |
cSeqAnnotation | NM_014208:ex5 NM_014208:ex5:c.A3447T:p.E1149D | yes for some of the 8 fields beginning with this one |
link | Text (if desired, can combine this + next 7 sequence annotations in this field, and multiple sequence annotations separated by commas are supported). If a DNA position and change is recognized, it is used to construct a ClinVar query. |
cPosition | 38 | no |
link | If this and the next 2 fields are included, ClinVar URLs for these variants are displayed instead of generic URLs for the gene |
cRef | A | no |
link | If this and the fields before and after are included, ClinVar URLs for these variants are displayed instead of generic URLs for the gene |
cAlt | G | yes, if using zygosities with more than one non-wildtype form |
link | If this and the previous 2 fields are included, ClinVar URLs for these variants are displayed instead of generic URLs for the gene |
pSeqAnnotation | NP_00100574.1 or p.E1149D or E1149D | no |
link | If no DNA position and change is recognized, the protein change is used to construct a ClinVar query. |
pPosition | 13 | no |
report | |
pRef | K | no |
report | |
pAlt | R | no |
report | |
rsid | rs12345678 | no |
compute | The percent of variants with rsID numbers is reported in the variant table processing text area. |
zygProband (to use identifiers within the software, use terms such as zygPaula here and Paula will be used as the identifier within the software) | Het | yes |
compute | Zygosity of proband: Accepts non-negative integers from 0-100, or case-insensitive text (het, heterozygous, hap, hom, homozygous, hemi, wt). Inputs treated as wt: unknown, ., -, none. Inputs of the form x/x (or the phased equivalents using |) are treated as hom for nonzero x; otherwise wt. Inputs of the form x/y can contribute to compound hererozygotes, even at the same locus. If the / or | forms are used, 1 (with no pipe or slash) is interpreted as hap (hemi). |
zygMother (to use identifiers within the software, use terms such as zygINDIV_35 here and INDIV_35 will be used as the identifier within the software) | 50 | no, if no genomic data is available from the proband's mother leave this blank but use a header |
compute | Zygosity of mother, as above |
zygFather (to use identifiers within the software, use terms such as zygGeorge here and George will be used as the identifier within the software) | 0 | no, if no genomic data is available from the proband's father leave this blank but use a header | compute | Zygosity of father, as above |
effect | missense | yes (most) |
compute | Effect terms that are recognized are listed at this link, though each group typically uses only a few of these, such as the core terms missense, frameshift, and synonymous. The listing here is periodically updated to include all SnpEff effect prediction terms, but let us know if we are missing any. The terms are case-insensitive. If multiple terms are used (separated by "|" or "&" or "," or "/"), each effect term is considered and the one with the highest severity is chosen. |
freq1 (to use identifiers within the software, use terms such as freqExAC here and ExAC will be used as the identifier in the mini variant table) | .0015 | no |
compute | Real numbers between 0-1, use the main frequency metric of your choice (e.g., 1000genomes). "NA"and "na" are interpreted as 0, in addition to the usual blank characters. |
freq2 (to use identifiers within the software, use terms such as freqAfrica here and Africa will be used as the identifier in the mini variant table) | .02 | no |
compute | Real numbers between 0-1, as above. On the Set Variant Parameters screen these can be chosen. |
homoShares | 0 | no |
compute | Count of times this particular mutation seen in homozygous form in unaffected individuals (for example, at your lab) (non-negative integers) |
heteroShares | 3 | no |
compute | Count of times this particular mutation seen in heterozygous form in unaffected individuals (for example, at your lab) (non-negative integers) |
omimNumber | 606463 | no |
report | Six digit number corresponding to the gene |
omimDiseaseNames | Gaucher disease | no |
report | Multiple diseases can be strung together for display |
variantAccession | CM065215 | no |
report | Variant accession number |
variantPathogenicity | DM or 5 | no |
compute | Variant pathogenicity report. For HGMD varient pathogenicity, DM is treated as severity 5, DM? as 4, and DP as 3. ClinVar values from 2 to 5 and their verbal equivalents (capitalized or uncapitalized) are treated as follows: 2 or Benign are severity 1, 3 or Likely benign are severity 2, 4 or Likely pathogenic are severity 3, and 5 or Pathogenic are severity 5. These values override other variant severity determinations, though the other scoring is described in the mini variant table that is displayed for non-benign variants. |
polyPhen | probably-damaging | no |
compute | Either verbal (probably-damaging, possibly-damaging, benign, DP, FP, DFP, D, P, B), or numerical (real number between 0 and 1, with damaging values near 1) but not both. If two terms are present the first one is used. |
mutationTaster | 0.73 | no |
compute | A, D, N, P (case-insensitive) or real number between 0 and 1, with damaging values near 1. |
sift | 0.68 | no |
compute | Either verbal (D, T (case-insensitive)) or numerical (real number between 0 and 1, with damaging values near 0 (can be configured for customers to use damaging values near 1)) but not both. If two terms are present the first one is used. |
gerp | 5.34 | no |
compute | Real number, with higher numbers more damaging. |
grantham | 29 | no |
compute | 0-215, with higher numbers more damaging. |
phat | 0.82 | no |
compute | Real number between 0 and 1, with higher numbers more damaging. "NA"and "na" are interpreted as 0, in addition to the usual blank characters. |
phast | 0.95 | no |
compute | Real number between 0 and 1, with higher numbers more damaging. "NA"and "na" are interpreted as 0, in addition to the usual blank characters. |
phyloP | 0.75 |
no |
compute | Rankscore real number between 0 and 1, with higher numbers more damaging. "NA"and "na" are interpreted as 0, in addition to the usual blank characters. |
strandBias | no | |||
knownSplice | no | compute | Real number between -1 and 1 reflecting disruption of a known splice site, with values near +1 being more damaging. | |
totDepthP | 105 | no |
compute | Total read depth for proband; non-negative integer. Blank or low values result in exclusion of the variant, but blanks for all variants result in this metric being ignored. |
varDepthP | 74 | no |
report | Variant read depth for proband; non-negative integer. |
qualP | 162 | no |
compute | Read quality score for proband; non-negative integer. Blank or low values result in exclusion of the variant, but blanks for all variants result in this metric being ignored. |
totDepthM | 99 | no |
report | Total read depth for mother; non-negative integer. |
varDepthM | 99 | no |
report | Variant read depth for mother; non-negative integer. |
qualM | 99 | no |
report | Read quality score for mother; non-negative integer. |
totDepthF | 250 | no |
report | Total read depth for father; non-negative integer. |
varDepthF | 99 | no |
report | Variant read depth for father; non-negative integer. |
qualF | 99 | no |
report | Read quality score for father; non-negative integer. |