Introduction

This document describes the output produced by the pipeline. The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory.

Pipeline overview

The pipeline is built using Nextflow and processes data depending on the type of ids provided:

Please see the usage documentation for a list of supported public repository identifiers and how to provide them to the pipeline.

SRA / ENA / DDBJ / GEO ids

Output files
  • fastq/
    • *.fastq.gz: Paired-end/single-end reads downloaded from the SRA / ENA / DDBJ / GEO.
  • fastq/md5/
    • *.md5: Files containing md5 sum for FastQ files downloaded from the ENA.
  • samplesheet/
    • samplesheet.csv: Auto-created samplesheet with collated metadata and paths to downloaded FastQ files.
    • id_mappings.csv: File with selected fields that can be used to rename samples to more informative names; see --sample_mapping_fields parameter to customise this behaviour.
    • multiqc_config.yml: MultiQC config file that can be passed to most nf-core pipelines via the --multiqc_config parameter for bulk renaming of sample names from database ids; --sample_mapping_fields parameter to customise this behaviour.
  • metadata/
    • *.runinfo_ftp.tsv: Re-formatted metadata file downloaded from the ENA.
    • *.runinfo.tsv: Original metadata file downloaded from the ENA.

The final sample information for all identifiers is obtained from the ENA which provides direct download links for FastQ files as well as their associated md5 sums. If download links exist, the files will be downloaded in parallel by FTP. Otherwise they are downloaded using sra-tools.

Synapse ids

Output files
  • fastq/
    • *.fastq.gz: Paired-end/single-end reads downloaded from Synapse.
  • fastq/md5/
    • *.md5: Files containing md5 sum for FastQ files downloaded from the Synapse platform.
  • samplesheet/
    • samplesheet.csv: Auto-created samplesheet with collated metadata and paths to downloaded FastQ files.
  • metadata/
    • *.metadata.txt: Original metadata file generated using the synapse show command.
    • *.list.txt: Original output of the synapse list command, containing the Synapse ids, file version numbers, file names, and other file-specific data for the Synapse directory ID provided.

FastQ files and corresponding sample information for Synapse identifiers are downloaded in parallel directly from the Synapse platform. A configuration file containing valid login credentials is required for Synapse downloads.

The final sample information for the FastQ files downloaded from Synapse is obtained from the file name itself. The file names are parsed according to the glob pattern *{1,2}*. This returns the sample name, presumed to be the longest possible string matching the glob pattern, with the fewest number of wildcard insertions. Further information on sample name parsing can be found in the usage documentation.

Pipeline information

Output files
  • pipeline_info/
    • Reports generated by Nextflow: execution_report.html, execution_timeline.html, execution_trace.txt and pipeline_dag.dot/pipeline_dag.svg.
    • Reports generated by the pipeline: pipeline_report.html, pipeline_report.txt and software_versions.tsv.

Nextflow provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage.