Introduction

This document describes the output produced by the pipeline.

The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory.

Pipeline overview

The pipeline is built using Nextflow and processes data using the following steps:

VCF-VQSR-normalizedQC

Output files
  • vcf_vqsr_normalizedQC/
    • *.biallelic.leftnorm.ABCheck.vcf.gz: The normalized and QC’d VCF files (one file for each chromosome).
    • *.biallelic.leftnorm.ABCheck.vcf.gz.tbi: Index files for the normalized and QC’d VCF files.
    • *.biallelic.leftnorm.ABCheck.vcf.gz.gds: GDS format files for the normalized and QC’d VCF files.

VCF-VQSR-normalizedQC step standardizes the indels and splits multi-allelic variants into several biallelic variants. The QC step makes sure DP >=10 and GQ >= 20 for called genotypes, otherwise it is set to missing. It also makes sure the allele fraction for each heterozygous genotype is between 0.2 and 0.8.

Annotation

Output files
  • annotation/
    • *.annotated.vcf.gz: The annotated VCF files.
    • *.annotated.vcf.gz.tbi: Index files for the annotated VCF files.
    • *.annotated.vcf.gz.gds: GDS format files for the annotated VCF files.

We provide a script to annotate a VCF file using ANNOVAR. Here we only added REVEL annotation in additiona to builtin gene annotations. However, many other annotations can be added, see the commented code in the script utilities/annotate.sh. It is also possible to use VEP for annotations and define custom R functions to restrict to the variant set of interest. See more in the help of CoCoRV in the R package for already defined sets of variants based on ANNOVAR.

gnomADPosition

Output files
  • gnomADPosition/
    • *.extracted.vcf.gz: The extracted variants for each chromosome used for gnomAD ethnicity classifier.
    • all.extracted.vcf.gz: All extracted variants for gnomAD ethnicity classifier.
    • PC.population.output.gz: PCA scores generated by gnomAD classifier for each case sample.
    • casePopulation.txt: The final assigned ethnicity for each case sample.

Here we used PCA variant loadings and random forest classifier model from gnomAD: v2, v4 for predicting ethnicity for each sample.

CoCoRV

Output files
  • CoCoRV/
    • association.tsv: The result file containing the association test results.
    • association.tsv.dominant.nRep1000.fdr.tsv: The result file containing FDR calculation.
    • association.tsv.dominant.nRep1000.pdf: QQ plot file, if the option —nullBoxplot is used, it will add the boxplot of the lambda on the right side under the null of no associations.
    • kept.variants.case.txt: The variants in each gene in cases after consistent QC and filtering.
    • kept.variants.control.txt: The variants in each gene in controls after consistent QC and filtering.
    • top*.association.tsv.case.variants.tsv: The variants and samples list for the top K genes along with the annotations for each varint.
  • association file: This is the main assocaiton test result file. DOM means the dominant model, REC means the recessive model and 2HETS means the double heterozygous model. For each mdoel, the columns are group by each ethnicity, with the same number of columns. The following shows the result of the dominant model and the columns corresponding to the NFE (nonFinnish European):
    • gene: the gene ID
    • P_DOM: the p-value of the dominant model
    • OR_DOM: the odds ratio of the dominant model
    • caseWMutation_NFE_DOM: the number of samples with the variants of interest in cases
    • caseWOMutation_NFE_DOM: the number of samples without the variants of interest in cases
    • controlWMutation_NFE_DOM: the number of samples with the variants of interest in controls
    • controlWMutation_NFE_DOM: the number of samples without the variants of interest in controls
    • other columns: other columns with the column name pattern “caseEstimated*” are used for debug and comparison, can be ignored in practical usage
  • fdr file: It adds FDRs calcualted from 5 different methods:
    • RBH_point_estimate: the resampling based method using the point estimate
    • RBH_uppper_limit: the resampling based method using the upper limit estimate DBH: discrete count based FDR calcualted from the R package discreteMTP
    • DBH.sd, ADBH.sd: discrete count based FDR calcualted from the R package DiscreteFDR. These five FDR estimates are often similar. DBH.sd, ADBH.sd seem to be good FDR choices.
  • QQ plot and lambda estimate: The file containing QQ plot. If the option —nullBoxplot is used, it will add the boxplot of the lambda on the right side under the null of no associations.
  • variants afer QC in cases: These show the variants in each gene in cases after consistent QC and filtering.
  • variants afer QC in controls: These show the variants in each gene in controls after consistent QC and filtering.
  • variants-samples list for top K genes: If we find some genes interesting, it is better to further check whether the variants driving the association are high quality variants and there is no obvious confounding due to ancestries/ethnicities. The script utilities/postCheckCoCoRV.sh can help extract related QC and annotation information to help this check.

Pipeline information

Output files
  • pipeline_info/
    • Reports generated by Nextflow: execution_report.html, execution_timeline.html, execution_trace.txt and pipeline_dag.dot/pipeline_dag.svg.
    • Reports generated by the pipeline: pipeline_report.html, pipeline_report.txt and software_versions.yml. The pipeline_report* files will only be present if the --email / --email_on_fail parameter’s are used when running the pipeline.
    • Reformatted samplesheet files used as input to the pipeline: samplesheet.valid.csv.
    • Parameters used by the pipeline run: params.json.

Nextflow provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage.