Introduction

Viralgenie is a bioinformatics best-practice analysis pipeline for reconstructing consensus genomes and to identify intra-host variants from metagenomic sequencing data or enriched based sequencing data like hybrid capture.

Pipeline summary

viralgenie-workflow

  1. Read QC (FastQC)
  2. Performs optional read pre-processing
  3. Metagenomic diversity mapping
    • Performs taxonomic classification and/or profiling using one or more of:
    • Plotting Kraken2 and Kaiju (Krona)
  4. Denovo assembly (SPAdes, TRINITY, megahit), combine contigs.
  5. [Optional] extend the contigs with sspace_basic and filter with prinseq++
  6. [Optional] Map reads to contigs for coverage estimation (BowTie2,BWAmem2 and BWA)
  7. Contig reference idententification (blastn)
    • Identify top 5 blast hits
    • Merge blast hit and all contigs of a sample
  8. [Optional] Precluster contigs based on taxonomy
    • Identify taxonomy Kraken2 and\or Kaiju
    • Resolve potential inconsistencies in taxonomy & taxon filtering | simplification bin/extract_precluster.py
  9. Cluster contigs (or every taxonomic bin) of samples, options are:
  10. [Optional] Remove clusters with low read coverage. bin/extract_clusters.py
  11. Scaffolding of contigs to centroid (Minimap2, iVar-consensus)
  12. [Optional] Annotate 0-depth regions with external reference bin/nocov_to_reference.py.
  13. [Optional] Select best reference from --mapping_constraints:
  14. Mapping filtered reads to supercontig and mapping constraints(BowTie2,BWAmem2 and BWA)
  15. [Optional] Deduplicate reads (Picard or if UMI’s are used UMI-tools)
  16. Variant calling and filtering (BCFTools,iVar)
  17. Create consensus genome (BCFTools,iVar)
  18. Repeat step 12-15 multiple times for the denovo contig route
  19. Consensus evaluation and annotation (QUAST,CheckV,blastn,prokka mmseqs-search, MAFFT - alignment of contigs vs iterations & consensus)
  20. Result summary visualisation for raw read, alignment, assembly, variant calling and consensus calling results (MultiQC)

Usage

Note

If you are new to Nextflow and nf-core, please refer to this page on how to set-up Nextflow. Make sure to test your setup with -profile test before running the workflow on actual data.

First, prepare a samplesheet with your input data that looks as follows:

samplesheet.csv:

sample,fastq_1,fastq_2
sample1,AEG588A1_S1_L002_R1_001.fastq.gz,AEG588A1_S1_L002_R2_001.fastq.gz
sample2,AEG588A5_S5_L003_R1_001.fastq.gz,
sample3,AEG588A3_S3_L002_R1_001.fastq.gz,AEG588A3_S3_L002_R2_001.fastq.gz

Each row represents a fastq file (single-end) or a pair of fastq files (paired end).

Now, you can run the pipeline using:

nextflow run Joon-Klaps/viralgenie \
   -profile <docker/singularity/.../institute> \
   --input samplesheet.csv \
   --outdir <OUTDIR>
Warning

Please provide pipeline parameters via the CLI or Nextflow -params-file option. Custom config files including those provided by the -c Nextflow option can be used to provide any configuration except for parameters; see docs.

For more details and further functionality, please refer to the usage documentation and the parameter documentation.

Pipeline output

To see the results of an example test run with a full size dataset refer to the results tab on the nf-core website pipeline page. For more details about the output files and reports, please refer to the output documentation.

Credits

Viralgenie was originally written by Joon-Klaps.

We thank the following people for their extensive assistance in the development of this pipeline:

Contributions and Support

If you would like to contribute to this pipeline, please see the contributing guidelines.

Citations

Warning

Viralgenie is currently not Published. Please cite as: Github https://github.com/Joon-Klaps/viralgenie

An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md file.