Introduction

This document describes the output produced by the pipeline. Most of the plots are taken from the MultiQC report, which summarises results at the end of the pipeline.

The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory.

Pipeline overview

The pipeline is built using Nextflow and processes data using the following steps:

  • FastQC - Raw read QC
  • MultiQC - Aggregate report describing results and QC from the whole pipeline
  • Pipeline information - Report metrics generated during the workflow execution

FastQC

Output files
  • preprocessing/
    • fastqc_raw/
      • *_fastqc.html: FastQC report containing quality metrics of input FASTQs.
      • *_fastqc.zip: Zip archive containing the FastQC report, tab-delimited data file and plot images.
    • fastqc_processed/
      • *_fastqc.html: FastQC report containing quality metrics of pipeline preprocessed FASTQs.
      • *_fastqc.zip: Zip archive containing the FastQC report, tab-delimited data file and plot images.

FastQC gives general quality metrics about your sequenced reads. It provides information about the quality score distribution across your reads, per base sequence content (%A/T/G/C), adapter contamination and overrepresented sequences. For further reading and documentation see the FastQC help pages.

MultiQC

Output files
  • multiqc/
    • multiqc_report.html: a standalone HTML file that can be viewed in your web browser.
    • multiqc_data/: directory containing parsed statistics from the different tools used in the pipeline.
    • multiqc_plots/: directory containing static images from the report in various formats.

MultiQC is a visualization tool that generates a single HTML report summarising all samples in your project. Most of the pipeline QC results are visualised in the report and further statistics are available in the report data directory.

Results generated by MultiQC collate pipeline QC from supported tools e.g. FastQC. The pipeline has special steps which also allow the software versions to be reported in the MultiQC output for future traceability. For more information about how to use MultiQC reports, see http://multiqc.info.

Pipeline information

Output files
  • pipeline_info/
    • Reports generated by Nextflow: execution_report.html, execution_timeline.html, execution_trace.txt and pipeline_dag.dot/pipeline_dag.svg.
    • Reports generated by the pipeline: pipeline_report.html, pipeline_report.txt and software_versions.yml. The pipeline_report* files will only be present if the --email / --email_on_fail parameter’s are used when running the pipeline.
    • Reformatted samplesheet files used as input to the pipeline: samplesheet.valid.csv.
    • Parameters used by the pipeline run: params.json.

Nextflow provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage.

Reference Indexing

Output files
  • reference/
    • <reference_id>/
      • *.{fasta,fna,fa,fa}: Uncompressed input FASTA file (if supplied to pipeline gzipped).
      • *.fasta.fai: SAMtools index file.
      • *.dict: picard CreateSequenceDictionary index file.
      • bwa/: Only present if BWA aligner selected.
        • *.fasta.{amb,ann,bwt,pac,sa}: BWA aligner(s) reference index files from bwa index.
      • bowtie2/: Only present if Bowtie2 aligner selected.
        • *.bt2: Bowtie2 aligner reference index files from bowtie2 build.

Depending on what is supplied by the user, and if --save_reference is supplied, this directory will contain various reference index files based on the reference FASTA file given to --fasta.

It is highly recommend to move these files to a central location or cache directory on your machine to facilitate resume of the indices across different pipeline runs. In many cases indexing the reference genome for alignment can be the longest step of a pipeline run, therefore re-using indices in future runs (supplied to the pipeline with flags such as --fasta_fai, --fasta_dict, etc. or added to the reference sheet provided to --fasta) can greatly speed up analyses on other samples.

Reference Elongation

Output files
  • reference/
    • <reference_id>_<elongation_factor>/
      • *.{fasta,fna,fa,fa}: Uncompressed input FASTA file (if supplied to pipeline gzipped).
      • bwa/:
        • *.fasta.{amb,ann,bwt,pac,sa}: BWA aligner(s) reference index files from bwa index.

Mapping with circularmapper requires an elongated reference built by CircularMapper/CircularGenerator. CircularGenerator elongates the --fasta_circular_target of a supplied reference genome fasta by the number of base pairs specified in --fasta_circularmapper_elongationfactor.

Depending on what is supplied by the user, and if --save_reference is supplied, this directory will contain the elongated reference fasta, as well as its corresponding bwa reference index files.

It is highly recommend to move these files to a central location or cache directory on your machine to facilitate resume of the indices across different pipeline runs. In many cases indexing the reference genome for alignment can be the longest step of a pipeline run, therefore re-using indices in future runs (supplied to the pipeline with flags such as --fasta_circularmapper_elongatedfasta, --fasta_circularmapper_elongatedindex, etc. or added to the reference sheet provided to --fasta) can greatly speed up analyses on other samples.

Preprocessing

Falco

Output files
  • preprocessing/
    • falco_raw/
      • *_falco.{html,txt}: Falco report containing quality metrics of input FASTQs.
    • falco_processed/
      • *_falco.{html,txt}: Falco report containing quality metrics of pipeline preprocessed FASTQs.

Falco is an emulation and drop-in replacement of the popular FastQC software to check large sequencing reads for common problems. It is technically more performant, and can be useful in cases where you have large data or issues with FastQC’s Java memory allocation. For further reading and documentation see the Falco GitHub repository.

Falco results in the raw/ directory contain the results of running Falco on the ‘raw’ input files (with no processing by the pipeline). The processed/ directory is for reads run through either fastp or AdapterRemoval preprocessing, and therefore will only be present if preprocessing is executed. These Falco results occur in both .html (the report) and .txt format (raw data), and should provide the same information as with FastQC.

AdapterRemoval

Output files
  • preprocessing/
    • *.settings: Log file of the AdapterRemoval execution containing trimming, merging, and other read summary statistics.
    • *.truncated.fastq.gz: Final FASTQ file of single-end data that have undergone quality trimming. These are the final reads used downstream for single-end data.
    • *.merged.fastq.gz: Final FASTQ file of paired-end data that have undergone quality trimming, and merging. Typically a concatenated file of the *collapsed* files. May, or may not include unmerged reads (singletons), depending on your parameters. These are the final reads used downstream for paired-end data.
    • *.pair{1,2}.truncated.fastq.gz: Final FASTQ file(s) containing paired-end reads that have undergone additional quality trimming but not merging. These are the final reads used downstream for paired-end data when merging is skipped. Only present for paired-end libraries.
    • *_collapsed.fastq.gz: FASTQ file containing merged reads of paired-end libraries that have not had trimming performed.
    • *_collapsed.truncated.fastq.gz: FASTQ file containing merged reads that have had additional quality trimming performed.
    • *_discarded.fastq.gz: FASTQ file containing merged reads that did not pass other quality thresholds (e.g. minimum length).
    • *.truncated.fastq.gz: FASTQ file(s) containing unmerged reads that have undergone additional quality trimming. Only present for paired-end libraries.
    • multiqc_plots/: directory containing static images from the report in various formats.

AdapterRemoval removes residual adapter sequences from single-end (SE) or paired-end (PE) FASTQ reads, optionally trimming Ns and low qualities bases and/or collapsing overlapping paired-end mates into one read.

The resulting FASTQ files will only be present in your results directory if you run with --preprocessing_savepreprocessedreads. The .settings log files should always be present if AdapterRemoval has been run.

fastp

Output files
  • preprocessing/

    • *.{log,html,json}: Log files of the AdapterRemoval execution containing trimming, merging, and other read summary statistics. The three files contain the same information stored in different formats.
    • *.merged.fastq.gz: Final FASTQ file of paired-end data that have undergone quality trimming, and merging. May, or may not include unmerged reads (singletons), depending on your parameters. These are the final reads used downstream for paired-end data.
    • *.fastp.fastq.gz: Final FASTQ file of single-end data that have undergone quality trimming. These are the final reads used downstream for single-end data.
    • *_{1,2}.fastp.fastq.gz: Final FASTQ file(s) containing paired-end reads that have had undergone additional quality trimming but not merging. These are the final reads used downstream for paired-end data when merging is skipped. Only present for paired-end libraries.

fastp is an all-in-one FASTQ preprocessor (QC/adapters/trimming/filtering/splitting/merging etc.)

The resulting FASTQ files will only be present in your results directory if you run with --preprocessing_savepreprocessedreads. The various log files should always be present if fastp has been run.

Mapping

BWA

Output files
  • mapping/bwa{aln,mem}/

    • *.bam: Sorted reads aligned against a reference genome in BAM format with no additional filtering.
    • *.{bai,csi}: Index file corresponding to a BAM file which is for faster downstream steps (e.g. SAMtools).
    • *.flagstat: Statistics of aligned reads from SAMtools flagstat.

BWA is a software package for mapping low-divergent sequences against a large reference genome, such as the human genome. BWA-backtrack (a.k.a bwa aln) is designed for Illumina sequence reads up to 100bp, while BWA-MEM (bwa mem) is optimised for longer sequences ranged from 70bp to 1Mbp and split alignment.

Bowtie 2

Output files
  • mapping/bowtie2/

    • *.bam: Sorted reads aligned against a reference genome in BAM format with no additional filtering.
    • *.{bai,csi}: Index file corresponding to a BAM file which is for faster downstream steps (e.g. SAMtools).
    • *.flagstat: Statistics of aligned reads from SAMtools flagstat.

Bowtie 2 is an ultrafast and memory-efficient tool for aligning sequencing reads to long reference sequences. It is particularly good at aligning reads of about 50 up to 100s of characters to relatively long (e.g. mammalian) genomes. Bowtie 2 indexes the genome with an FM Index (based on the Burrows-Wheeler Transform or BWT) to keep its memory footprint small and supports gapped, local, and paired-end alignment modes.

CircularMapper

Output files
  • mapping/circularmapper/

    • *.bam: Sorted reads aligned against an elongated reference genome in BAM format with no additional filtering.
    • *.{bai,csi}: Index file corresponding to a BAM file which is for faster downstream steps (e.g. SAMtools).
    • *.flagstat: Statistics of aligned reads from SAMtools flagstat.

CircularMapper RealignSAMFile is an extension to bwa aln for realigning reads mapped to circularised contigs. First, an elogated/circularised reference is built using CircularGenerator, then reads are mapped to this reference using BWA ALN. The resulting BAM file is then realigned using CircularMapper RealignSAMFile. The reference coordinates of this BAM file have been adjusted to those of the original reference genome (prior to elongation).

Host Removal

Output files
  • host_removal/

    • *.fq.gz: FASTQ files containing only reads that did not map to the reference genome.

The nf-core/eager host removal step removes any reads from the original input FASTQ files that mapped to the provided reference. This step is performed by a custom python script extract_map_reads.py.

The resulting FASTQ files can be used for data sharing purposes where the host DNA should not be made publically available for ethical reasons (for example when you do not have permission to analyse the host DNA but only analysis on the metagenomic content were agreed) or for due to identifiability reasons (for example, for individuals that died in recent times and could be identified based on their DNA alone).

Alternatively you could use the resulting files for manual metagenomic screen outside of nf-core/eager (e.g. when your preferred taxonomic classifier isn’t supported by nf-core/eager).

Mapping statistics

EndorSpy

Output files
  • endorspy

    • *.json: json file per sample containing all the calculated percent on target, clonality and percent duplicates

endorS.py calculates percent on target (aka Endogenous DNA) from samtools flagstat files and prints to screen. The ‘Percent on target (%)’ reported will be different depending on the combination of samtools flagstat inputs provided. This program also calculates clonality (aka ‘cluster factor’) and percent duplicates when the flagstat file after duplicate removal is provided.

Definitions and calculations:

We provide up to 3 different estimates of percent on target:

Percent on target (%): percent of mapping reads mapping to a specific reference. This is calculated as follows: number of reads mapping to the specific reference (Mapped_Raw) in comparison to the total number of reads (Total_Reads)

PercentOnTarget=MappedRawTotalReads\*100PercentOnTarget = { MappedRaw \over TotalReads \* 100 }

Where MappedRawMapped_Raw is the number of reads mapping to the reference during the mapping step, and TotalReadsTotal_Reads are the total number of reads that went into the mapping step.

Percent on target modified (%): percent of mapping reads after filtering the raw bam based on quality, read length or any other filtering performed that mapped to a specific reference:

PercentOnTargetModified=MappedPostFilteringTotalReads\*100PercentOnTargetModified = { MappedPostFiltering \over TotalReads \* 100 }

Where MappedPostFilteringMapped_Post_Filtering are the number of reads mapping to the reference after applying the filters.

Percent on target postdedup (%): percent of deduplicated reads (either raw mapped reads or filtered mapped reads) that mapped to a specific reference:

PercentOnTargetPostdedup=MappedPostDedupTotalReads\*100PercentOnTargetPostdedup = { MappedPostDedup \over TotalReads \* 100 }

Additionally, this script provides two different ways of estimating library complexity:

Clonality (Cluster Factor in eager1): ratio of reads that have duplicated reads

Clonality=TotalReadsPreDedupmappedDedupClonality = { TotalReadsPreDedup \over mappedDedup }

Percent Duplicates (%): percent of mapping reads that have at least 1 duplicate. It is calculated as follows:

PercentDuplicates=(TotalReadsPreDedupMappedReadsDedup)TotalReadsPreDedup\*100PercentDuplicates = {(TotalReadsPreDedup - MappedReadsDedup) \over TotalReadsPreDedup \* 100}

The combination of statistics calculated would vary depending on the steps taken: Mapping/bam input + filtering + deduplication (all): Percent on target (%) Percent on target modified (%) Percent on target postdedup (%) Clonality Percent Duplicates (%)

Mapping/bam input: Percent on target (%)

Mapping/bam input + filtering: Percent on target (%) Percent on target modified (%)

Mapping/bam input + deduplication: Percent on target (%) Percent on target postdedup (%) Clonality Percent Duplicates (%)

⚠️ Warning: When bam input is provided, please keep in mind it is assumed that this is an unfiltered bam. If you provide an already filtered bam, the percent on target calculations will be wrong since the original total number of reads in the initial fastq/bam cannot be inferred.

BAM Filtering

Output files
  • bam_filtering/

    • *length.filtered.bam: BAM file containing length-filtered mapped and unmapped reads.
    • *.filtered.bam: BAM file containing mapped quality filtered reads (and optionally length filtering and unmapped reads, if specified by the user with the corresponding parameters).
    • *.{bai,csi}: Corresponding index files of any generated BAM files.
    • *.unmapped_other_{1,2,other,singleton}.fastq.gz: FASTQ file containing only unmapped reads, if specified for generation, without length filtering.
      • If you do read merging or have single-end data you will only get one file containing (*_other*) containing all reads (the others will be empty).
      • If you skip merging for paired-end data you’ll get reads separately (_1, _2, _singleton), with other being empty.
    • *mapped_{1,2,other,singleton}.fastq.gz: FASTQ file containing only mapped reads, if specified for generation, without length or quality filtering. If you do read merging you will only get one file containing (*_other*) containing all reads. If you skip merging you’ll get them separately (_1, _2 etc.).
    • *filtered.flagstat: Statistics of aligned reads from SAMtools flagstat.

nf-core/eager can perform a range of different length, quality, and/or mapped-unmapped filtering of BAM files, and generate converted FASTQ files for manual analysis outside the pipeline.

If bam filtering is turned on, by default only mapped reads are retained for downstream genomic analysis. Additionally, please note that removal of PCR duplicates in nf-core/eager will additionally filter for mapped reads only.

Please be aware, that intermediate length and mapping quality filtered genomic BAM files are not saved in the results directory automatically, with the assumption that downstream deduplicated BAM files are preferred for further analyses - see the parameters page for more information if you wish to save these.

You may also receive the files above if metagenomic screening is turned on.

Metagenomics Complexity Filtering

BBDuk

BBDuk “Duk” stands for Decontamination Using Kmers. BBDuk was developed to combine most common data-quality-related trimming, filtering, and masking operations into a single high-performance tool.

Output files
  • metagenomics/complexity_filter/bbduk

    • *_complexity.fastq.gz: FASTQ file containing the complexity filtered reads
    • *.log: LOG file containing filter stats

The entropy filter of BBDuk is used to remove reads from the fastq-files for metagenomics screening that don’t pass the --metagenomics_complexity_entropy threshold. When ommiting the flag, a default value of 0.3 is applied. To cite the docs:

A homopolymer such as AAAAAAAAAAAAAA would have entropy of zero; completely random sequence would have entropy approaching 1.

Using complexity-filtered fastq-files as input for metagenomic classifiers can reduce the number of false positive classifications, resulting in more precise taxonomic assignments of the sample through removal of reads that can align equally well to multiple reference genomes. Save the complexity-filtered fastq-files to the output directory to perform additional downstream analyses, such as testing multiple metagenomic profilers (see the nf-core/taxprofiler pipeline).

Note: To save output files, set the --metagenomics_complexity_savefastq flag

PRINSEQ++

PRINSEQ++ is a C++ implementation of the prinseq-lite.pl program. It can be used to filter, reformat or trim genomic and metagenomic sequence data.

Output files
  • metagenomics/complexity_filter/prinseq

    • *_complexity_good_out.fastq.gz: FASTQ file containing the complexity filtered reads
    • *_complexity.log: LOG file containing filter stats

PRINSEQ++ is an alternative to BBDuk for filtering the fastq files before metagenomics classification. From PRINSEQ++ we implemented filtering by the dust algorithm or by entropy, as explained above in the BBDuk section.

Using complexity-filtered fastq-files as input for metagenomic classifiers can reduce the number of false positive classifications, resulting in more precise taxonomic assignments of the sample. Save the complexity-filtered fastq-files to the output directory to perform additional downstream analyses, such as testing multiple metagenomic profilers (see the nf-core/taxprofiler pipeline).

The saved files are the good files, passing the dust or entropy filter treshold specified. The logs contain information about the amount of reads filtered.

Note: To save output files, set the --metagenomics_complexity_savefastq flag

Metagenomics Profiling

MALT

MALT is a fast replacement for BLASTX, BLASTP and BLASTN, and provides both local and semi-global alignment capabilities.

Output files
  • metagenomics/profiling/malt/
    • <sample_id>.rma6: binary file containing all alignments and taxonomic information of hits that can be loaded into the MEGAN6 interactive viewer
    • <sample_id>.blastn.sam: sparse SAM file containing alignments of each hit (if --metagenomics_malt_savereads)
    • *.log: LOG files containing the log of the MALT execution. NOTE: If you are running parallel malt runs with --metagenomics_malt_group_size set above 0, you will obtain a log file named <database>_<strandedness>_<group_id>-run-malt-run.log for each group of the parallel executions. The <database>_runtime_log_concatenated.log file contains the concatenated logs of all the groups.

MALT is a metagenomic aligner (equivalent to BLAST, but much faster). It produces direct alignments of sequencing reads in a reference genome. It is often used for metagenomic profiling or pathogen screening, and specifically in nf-core/eager, of off-target reads from genome mapping. It is popular by palaeogenomicists as the alignment information can be used for damage pattern and other authentication criteria analysis.

You will receive output for each library. This means that if you use TSV input and have one library sequenced over multiple lanes and sequencing types, these are merged and you will get mapping statistics of all lanes and sequencing configurations in one value.

The main output of MALT is the .rma6 file format, which can be only loaded into MEGAN and it’s related tools. The rma-file is further processed by the taxpasta module to provide a standardised tabular output for the MEGAN classifications

You will only receive the .sam files if you supply --metagenomics_malt_savereads parameters to the pipeline.

MetaPhlAn

MetaPhlAn is a computational tool for profiling the composition of microbial communities (Bacteria, Archaea and Eukaryotes) from metagenomic shotgun sequencing data (i.e. not 16S) with species-level resolution via marker genes.

Output files
  • metagenomics/profiling/metaphlan/
    • <sample_id>.biom: taxonomic profile in BIOM format
    • <sample_id>.bowtie2out.txt: BowTie2 alignment information (can be re-used for skipping alignment when re-running MetaPhlAn with different parameters)
    • <sample_id>_profile.txt: MetaPhlAn taxonomic profile including abundance estimates

The main taxonomic profiling file from MetaPhlAn is the *_profile.txt file. This provides the abundance estimates from MetaPhlAn however does not include raw counts by default. The profiling-file is further processed by the taxpasta module to provide a standardised tabular output for the Metaphlan abundance estimates.

Raw counts can be inferred from the Bowtie2 output .bowtie2out.txt, which presents a condensed representation of the mapping results of the sequencing reads to MetaPhlAn’s marker gene sequences. The alignments are listed in tab-separated columns, including Read ID and Marker Gene ID, with each alignment represented on a separate line. No hits to the marker genes results in an empty file.

Kraken2

Kraken is a taxonomic sequence classifier that assigns taxonomic labels to DNA sequences. Kraken examines the k-mers within a query sequence and uses the information within those k-mers to query a database. That database maps -mers to the lowest common ancestor (LCA) of all genomes known to contain a given k-mer.

Output files
  • metagenomics/profiling/kraken2/
    • <sample_id>.classified.fastq.gz: FASTQ file containing all reads that had a hit against a reference in the database for a given sample
    • <sample_id>.unclassified.fastq.gz: FASTQ file containing all reads that did not have a hit in the database for a given sample
    • <sample_id>.report.txt: A Kraken2 report that summarises the fraction abundance, taxonomic ID, number of Kmers, taxonomic path of all the hits in the Kraken2 run for a given sample. Will be 6 column rather than 8 if --metagenomics_kraken2_saveminimizers specified.
    • <sample_id>.classifiedreads.txt: A list of read IDs and the hits each read had against each database for a given sample

The main taxonomic classification file from Kraken2 is the _combined_reports.txt or *report.txt file. The former provides you the broadest over view of the taxonomic classification results across all samples against a single database, where you get two columns for each sample e.g. 2_all and 2_lvl, as well as a summarised column summing up across all samples tot_all and tot_lvl. The latter gives you the most information for a single sample. The report file is also used for the taxpasta step.

You will only receive the .fastq and *classifiedreads.txt file if you supply --metagenomics_kraken2_savereads and/or --metagenomics_kraken_save_readclassifications parameters to the pipeline.

KrakenUniq

KrakenUniq (formerly KrakenHLL) is an extenson to the fast k-mer-based classification Kraken with an efficient algorithm for additionally assessing the coverage of unique k-mers found in each species in a dataset.

Output files
  • metagenomics/profiling/krakenuniq/
    • <sample_id>.classified.fastq.gz: FASTQ file containing all reads that had a hit against a reference in the database for a given sample
    • <sample_id>.unclassified.fastq.gz: FASTQ file containing all reads that did not have a hit in the database for a given sample
    • <sample_id>.report.txt: A Kraken2-style report that summarises the fraction abundance, taxonomic ID, number of Kmers, taxonomic path of all the hits, with an additional column for k-mer coverage, that allows for more accurate distinguishing between false-positive/true-postitive hits
    • <sample_id>.classifiedreads.txt: A list of read IDs and the hits each read had against each database for a given sample

The main taxonomic classification file from KrakenUniq is the *report.txt file. This is an extension of the Kraken2 report with the additional k-mer coverage information that provides more information about the accuracy of hits.

You will only receive the *.fastq.gz and *.classifiedreads.txt file if you supply --metagenomics_kraken2_savereads and/or --metagenomics_kraken2_savereadclassifications parameters to the pipeline.

Info

The output system of KrakenUniq can result in other stdout or stderr logging information being saved in the report file, therefore you must check your report files before downstream use!

Metagenomics Postprocessing

taxpasta

The output created by the taxpasta merge or taxpasta standardise command. It combines the results of all the samples analyzed with a given metagenomic classifer by nf-core/eager in a standardised tabular taxon-table format. The file provides an overview of the classification results for all samples combined

Output files
  • metagenomics/postprocessing/taxpasta/
    • {metaphlan,krakenuniq,kraken2,megan6}_taxpasta_table.tsv

maltExtract

The output directory for maltExtract, as implemented under HOPS, which applies various heuristics of ancient authenticity and presence to MEGAN read assignments across a given set of candidate taxon.

Output files
  • metagenomics/postprocessing/maltextract/
    • results: Results output by maltextract
      • default: Directory containing summary TSV tables for all reads
      • ancient: Directory containing summary TSV tables for reads with evidence of aDNA damage
      • pdf_candidate_profiles: Directory containing directories for each candidate taxon with varying levels of support for presence of genetic material for a given sample
      • count_table.tsv: TSV containing reads assigned to each node in candidate taxon list across all samples

The main files of interest are within the pdf_candidate_profiles directory. The file prefixes declare various levels of confidence in a given sample, with stp1 being less confidently ancient and present than stp2, than stp3. Results are highly dependent upon the taxon being analyzed, as different microbial genera are more liable to cross mapping and contamination than others.

Deduplication

Output files
  • deduplication/

    • *.dedupped.bam: Unique reads aligned to a reference genome in BAM format.
    • *.dedupped.bam.{bai,csi}: Index file corresponding to the BAM file.
    • *.dedupped.flagstat: Statistics of aligned reads from SAMtools flagstat, after removal of PCR duplicates.

Deduplication is carried by two possible tools, as described below. However the expected output files are should be the same.

picard MarkDuplicates

Picard is a toolkit for general BAM file manipulation with many different functions. nf-core/eager most visibly uses the ‘markduplicates’ tool, for the removal of exact PCR duplicates that can occur during library amplification and results in false inflated coverages (and overly-confident genotyping calls).

DeDup

DeDup is a merged-read deduplication tool capable of performing merged-read deduplication on paired-end sequencing data in BAM files.

Preseq

Output files
  • mapping/

    • *.{c_curve,lc_extrap}.txt: A two column text file with the first column representing sequencing depth and the second an estimate of unique reads.

Preseq Preseq is a collection of tools that allow assessment of the complexity of the library, where complexity means the number of unique molecules in your library (i.e. not molecules with the exact same length and sequence). There are two algorithms from the tools we use: c_curve and lc_extrap. The former gives you the expected number of unique reads if you were to repeated sequencing but with fewer reads than your first sequencing run. The latter tries to extrapolate the decay in the number of unique reads you would get with re-sequencing but with more reads than your initial sequencing run.

The resulting histogram file will contain estimated deduplication statistics at different theoretical sequencing depths, and can be used to generate a complexity curve for estimating the amount unique reads that will be yield if the library is re-sequenced.

These curves will be displayed in the pipeline run’s MultiQC report, however you can also use this file for plotting yourself for further exploration e.g. in R to find your sequencing target depth.

Mapping Statistics

QualiMap

Output files
  • qualimap/

    • <sample_id>/
      • *.html: in-depth report including percent coverage, depth coverage, GC content, etc. of mapped reads
      • genome_results.txt
    • css/: HTML CSS styling used for the report
    • images_qualimapReport/: PNG version of images from the HTML report.
    • raw_data_qualimapReport/: The raw data used to render the HTML report as TXT files.

QualiMap is a tool which provides statistics on the quality of the mapping of your reads to your reference genome. It allows you to assess how well covered your reference genome is by your data, both in ‘fold’ depth (average number of times a given base on the reference is covered by a read) and ‘percentage’ (the percentage of all bases on the reference genome that is covered at a given fold depth). These outputs allow you to make decision if you have enough quality data for downstream applications like genotyping, and how to adjust the parameters for those tools accordingly.

NB: Neither fold coverage nor percent coverage on its own is sufficient to assess whether you have a high quality mapping. Abnormally high fold coverages of a smaller region such as highly conserved genes or un-removed-adapter-containing reference genomes can artificially inflate the mean coverage, yet a high percent coverage is not useful if all bases of the genome are covered at just 1x coverage.

Note that many of the statistics from this module are displayed in the General Stats table, as they represent single values that are not plottable.

You will receive output for each sample. This means you will statistics of deduplicated values of all types of libraries combined in a single value (i.e. non-UDG treated, full-UDG, paired-end, single-end all together).

⚠️ Warning: If your library has no reads mapping to the reference, this will result in an empty BAM file. Qualimap will therefore not produce any output even if a BAM exists!

Bedtools

Output file
  • mapstats/bedtools/

    • *.breadth.gz: This file will have the contents of your annotation file (e.g. BED/GFF), and the following subsequent columns: no. reads on feature, # bases at depth, length of feature, and % of feature.
    • *.depth.gz: This file will have the the contents of your annotation file (e.g. BED/GFF), and an additional column which is mean depth coverage (i.e. average number of reads covering each position).

bedtools utilities are a swiss-army knife of tools for a wide-range of genomics analysis tasks. Bedtools allows one to intersect, merge, count, complement, and shuffle genomic intervals from multiple files in widely-used genomic file formats such as BAM, BED, GFF/GTF, VCF.

The bedtools coverage tool computes both the depth and breadth of coverage of features in file B (alignment file) on the features in file A (provied by --mapstats_bedtools_featurefile when running the eager workflow). One advantage that bedtools coverage offers is that it not only counts the number of features that overlap an interval in file A, it also computes the fraction of bases in the interval in A that were overlapped by one or more features. Thus, bedtools coverage also computes the breadth of coverage observed for each interval in A.

The output from this module can be useful for things such as checking for the presence/absence of virulence factors in ancient pathogen genomes, or getting statistics on SNP capture positions.

Damage Manipulation

There are three different options for manipulation of ancient DNA damage.

Damage Rescaling

Output files
  • damage_manipulation/

    • *_rescaled.bam: Reads with their base qualities rescaled according to the bayesian aDNA damage model, in BAM format.
    • *_rescaled.bam.{bai,csi}: Index file corresponding to the BAM file.
    • results_*/Stats_out_MCMC_*{.pdf,.csv}: CSV and PDF files containing information about the damage model used in rescaling.

mapDamage2 can apply a probabilistic bayesian model to rescale quality scores of likely-damaged positions in the reads. A new BAM file is constructed by downscaling quality values for misincorporations likely due to ancient DNA damage according to their initial qualities, position in reads, and damage patterns. These BAM files have damage probabilistically removed via a bayesian model. Rescaling a BAM file in this way can help reduce/remove the effects of DNA damage from downstream analysis.

Be advised that this process introduces reference bias in the resulting rescaled BAMs, because only mismatches to the reference get corrected, while true mismatches that become reference alleles due to damage are not rescaled.

This functionality does not produce any MultiQC output.

⚠️ Warning: rescaled libraries will not be merged with non-scaled libraries of the same sample for downstream genotyping, as the underlying damage model may be different for each library. If you wish to merge these, please do this manually and re-run nf-core/eager using the merged BAMs as input.

Post-Mortem-Damage (PMD) Filtering

Output files
  • damage_manipulation/

    • *_pmdfiltered.bam: Reads aligned to a reference genome that pass the post-mortem-damage threshold, in BAM format.
    • *_pmdfiltered.bam.{bai,csi}: Index file corresponding to the BAM file.
    • *_pmdfiltered.flagstat: Statistics of aligned reads after from SAMtools flagstat, after filtering for post-mortem damage.

pmdtools implements a likelihood framework incorporating a postmortem damage (PMD) score, base quality scores and biological polymorphism to identify degraded DNA sequences that are unlikely to originate from modern contamination. Using the model, each sequence is assigned a PMD score, for which positive values indicate support for the sequence being genuinely ancient. By filtering a BAM file for damaged reads in this way, it is possible to ameliorate the effects of present-day contamination, and isolate sequences of likely ancient origin to be used downstream. By default, all positions are assessed for damage, but it is possible to provide a FASTA file masked for specific references (--damage_manipulation_pmdtools_masked_reference) or a BED file to mask the reference FASTA within nf-core/eager (--damage_manipulation_pmdtools_reference_mask). This can alleviate reference bias when running PMD filtering on capture data, where you might not want the allele of a SNP to be counted as damage when it is a transition.

Base Trimming

Output files
  • damage_manipulation/

    • *{_pmdfiltered,}_trimmed.bam: Reads whose ends have been trimmed to mitigate the effects of aDNA damage, in BAM format.
    • *{_pmdfiltered,}_trimmed.bam.{bai,csi}: Index file corresponding to the BAM file.

bamUtil trimBam trims the end of reads in a SAM/BAM file, changing read ends to N and quality to !, or by soft clipping (if command-line option, --clip is specified). By default reverse strands are reversed and then the left & right are trimmed. This means that --left actually trims from the right of the read in the SAM/BAM for reverse reads.

Within nf-core/eager, when BAM trimming is activated alongside PMD-filtering, trimming is performed on the PMD-filtered reads only.

Damage Estimation

DamageProfiler

Output files
  • damage_estimation/damageprofiler/: this contains sample specific directories containing raw statistics and damage plots from DamageProfiler.

    • *.pdf: can be used to visualise C to T miscoding lesions or read length distributions of your mapped reads. All raw statistics used for the PDF plots are contained in the .txt files.

DamageProfiler is a tool which calculates a variety of standard ‘aDNA’ metrics from a BAM file. The primary plots here are the misincorporation and length distribution plots. Ancient DNA undergoes depurination and hydrolysis, causing fragmentation of molecules into gradually shorter fragments, and cytosine to thymine deamination damage, that occur on the subsequent single-stranded overhangs at the ends of molecules.

mapDamage2

Output files
  • damage_estimation/mapDamage2/: this contains sample specific directories containing raw statistics and damage plots from mapDamage2.

    • Fragmisincorporation_plot.pdf: a pdf file that displays both fragmentation and misincorporation patterns.
    • Length_plot.pdf: a pdf file that displays length distribution of singleton reads per strand and cumulative frequencies of C->T at 5’-end and G->A at 3’-end are also displayed per strand.
    • misincorporation.txt: contains a table with occurrences for each type of mutations and relative positions from the reads ends.
    • 5pCtoT_freq.txt: contains frequencies of Cytosine to Thymine mutations per position from the 5’-ends.
    • 3pGtoA_freq.txt: contains frequencies of Guanine to Adenine mutations per position from the 3’-ends.
    • dnacomp.txt: contains a table of the reference genome base composition per position, inside reads and adjacent regions.
    • lgdistribution.txt: contains a table with read length distributions per strand.
    • Runtime_log.txt: log file with a summary of command lines used and timestamps.

mapDamage2 is a computational framework written in Python and R, which tracks and quantifies DNA damage patterns among ancient DNA sequencing reads generated by Next-Generation Sequencing platforms. Based on a sample bam file and a fasta reference file, the software calculates the frequency of 5’ C to T and 3’ G to A misincorporations, as well as the read length distribution per strand.

Contamination estimation for Nuclear DNA

ANGSD nuclear contamination

Output files
  • contamination_estimation/angsd/

    • *.txt: Text file containing the results of nuclear contamination estimation with ANGSD for each library.
    • nuclear_contamination.txt: Text file containing a summary table of contamination estimates for all libraries.
    • nuclear_contamination_mqc.json: JSON file containing a summary table of contamination estimates for all libraries.

ANGSD is a software for analyzing next generation sequencing data. Among other functions, ANGSD can estimate contamination for chromosomes for which one copy exists, i.e. X-chromosome for humans with karyotype XY. To do this, we first generate a binary count file for the X-chromosome (angsd) and then perform a Fisher’s exact test for finding a p-value and jackknife to get an estimate of contamination (contamination). Contamination is estimated with Method of Moments (MOM) and Maximum Likelihood (ML) for both Method1 and Method2. Method1 compares the total number of minor and major reads at SNP sites with the number of minor and major reads at adjacent sites, assuming independent errors between reads and sites, while Method2 only samples one read at each site to remove the previous assumption. The results of all methods for each library, as well as respective standard errors are summarised in nuclear_contamination.txt and nuclear_contamination_mqc.json.

Sex Determination

Output files
  • sex_determination/: this contains the output for the sex determination run. This is a single .tsv file that includes a table with the sample name, the number of autosomal SNPs, number of SNPs on the X/Y chromosome, the number of reads mapping to the autosomes, the number of reads mapping to the X/Y chromosome, the relative coverage on the X/Y chromosomes, and the standard error associated with the relative coverages. These measures are provided for each bam file, one row per file. If the sexdeterrmine_bedfile option has not been provided, the error bars cannot be trusted!

Sex.DetERRmine

Sex.DetERRmine calculates the coverage of your mapped reads on the X and Y chromosomes relative to the coverage on the autosomes (X-/Y-rate). This metric can be thought of as the number of copies of chromosomes X and Y that is found within each cell, relative to the autosomal copies. The number of autosomal copies is assumed to be two, meaning that an X-rate of 1.0 means there are two X chromosomes in each cell, while 0.5 means there is a single copy of the X chromosome per cell. Human females have two copies of the X chromosome and no Y chromosome (XX), while human males have one copy of each of the X and Y chromosomes (XY).

When a bedfile of specific sites is provided, Sex.DetERRmine runs much faster and additionally calculates error bars around each relative coverage estimate. For this estimate to be trustworthy, the sites included in the bedfile should be spaced apart enough that a single sequencing read cannot overlap multiple sites. Hence, when a bedfile has not been provided, this error should be ignored. When a suitable bedfile is provided, each observation of a covered site is independent, and the error around the coverage is equal to the binomial error estimate. This error is then propagated during the calculation of relative coverage for the X and Y chromosomes.

Note that in nf-core/eager this will be run on single- and double-stranded variants of the same library separately. This can also help assess for differential contamination between libraries.

Genotyping

pileupCaller

Output files
  • genotyping/

    • *.geno: Eigenstrat-formatted file containing the table of genotype calls.
    • *.snp: Eigenstrat-formatted file containing the SNP annotation of the genotype table.
    • *.ind: Eigenstrat-formatted file containing the individual annotation of the genotype table.
    • *_coverage.tsv: Tab-separated file containing the number of covered SNPs and total number of SNPs for each individual in the eigenstrat dataset.

sequencetools pileupCaller is a tool to create genotype calls from bam files using read-sampling methods. It can call genotypes from samtools mpileup output, and is often used for ancient DNA. The output files are in Eigenstrat format, in which the gentype dataset is split across 3 different files (similar to PLINK). The .geno file contains the genotype table. Each number represents the genotype call of an individual at a specific position, with each column represents an individual, while each row represents a SNP. The .snp file contains the SNP annotation of the genotype table, and the .ind file contains the individual annotation of the genotype table. We provide an additional file named *_coverage.tsv which contains the number of covered SNPs and total number of SNPs for each individual in the eigenstrat dataset. One dataset is generated per reference genome, which includes genotypes from both single- and double-stranded libraries.

When using pileupCaller for genotyping, single-stranded and double-stranded libraries are genotyped separately. Single-stranded libraries are genotyped with the additional option --singeStrandMode, which ensure that deamination damage artefactts cannot affect the genotype calls, by only using the forward- or reverse-mapping reads when genotyping on transitions (depending on the alleles of the transition).

GATK UnifiedGenotyper

Output files
  • genotyping/

    • *.vcf.gz: VCF file containing the genotype calls for each sample.
    • *.vcf.gz.tbi: Tabix index file for the VCF file.
    • *.stats.txt: Statistics of the VCF file from bcftools stats.

GATK’s UnifiedGenotyper uses a Bayesian genotype likelihood model to estimate simultaneously the most likely genotypes and allele frequency in a population of N samples, emitting a genotype for each sample. The system can either emit just the variant sites or complete genotypes (which includes homozygous reference calls) satisfying some phred-scaled confidence value. This tool has been deprecated by the GATK developers in favour of HaplotypeCaller, but is still cosidered a preferable genotyper for ancient DNA, given its ability to handle low coverage data. The output provided is a bgzipped VCF file for containing the genotype calls of each sample, it’s index file, as well as the statistics of the VCF file generated by bcftools stats.

GATK HaplotypeCaller

Output files
  • genotyping/

    • *.vcf.gz: VCF file containing the genotype calls for each sample.
    • *.vcf.gz.tbi: Tabix index file for the VCF file.
    • *.stats.txt: Statistics of the VCF file from bcftools stats.

GATK’s HaplotypeCaller is capable of calling SNPs and indels simultaneously via local de-novo assembly of haplotypes in an active region. In addition, HaplotypeCaller is able to handle non-diploid organisms as well as pooled experiment data. This is the preferred genotyper for modern DNA. The output provided is a bgzipped VCF file for containing the genotype calls of each sample, it’s index file, as well as the statistics of the VCF file generated by bcftools stats.

FreeBayes

Output files
  • genotyping/

    • *.vcf.gz: VCF file containing the genotype calls for each sample.
    • *.vcf.gz.tbi: Tabix index file for the VCF file.
    • *.stats.txt: Statistics of the VCF file from bcftools stats.

FreeBayes is a Bayesian genetic variant detector designed to find small polymorphisms, specifically SNPs (single-nucleotide polymorphisms), indels (insertions and deletions), MNPs (multi-nucleotide polymorphisms), and complex events (composite insertion and substitution events) smaller than the length of a short-read sequencing alignment. It calls variants based on the literal sequences of reads aligned to a particular target, not their precise alignment. This model is a straightforward generalization of previous ones (e.g. PolyBayes, samtools, GATK) which detect or report variants based on alignments. This method avoids one of the core problems with alignment-based variant detection - that identical sequences may have multiple possible alignments. The output provided is a bgzipped VCF file for containing the genotype calls of each sample, it’s index file, as well as the statistics of the VCF file generated by bcftools stats.

ANGSD

Output files
  • genotyping/

    • *.{glf,beagle}.gz: Genotype likelihood file, containing likelihoods across all samples per reference.

ANGSD is a software for analyzing next generation sequencing data. It can estimate genotype likelihoods and allele frequencies from next-generation sequencing data. The output provided is a bgzipped genotype likelihood file, containing likelihoods across all samples per reference. Users can specify the model used for genotype likelihood estimation, as well as the output format. For more information on the available options, see the ANGSD.