Introduction

This document describes the output produced by the pipeline.

The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory.

Pipeline overview

The pipeline is built using Nextflow and processes data using the following steps:

CCS

Output files
  • 01_PBCCS/
    • <sample>.chunk<X>.bam: The CCS sequences
    • <sample>.chunk<X>.bam.pbi: The Pacbio index of CCS files
    • <sample>.chunk<X>.metrics.json.gz: Statistics for each zmws
    • <sample>.chunk<X>.report.json: General statistics about generated CCS sequences in json format
    • <sample>.chunk<X>.report.txt: General statistics about generated CCS sequences in txt format

CCS generate a Circular Consensus Sequence from subreads. It reports the number of selected and discarded zmws and the reason why.

LIMA

Output files
  • 02_LIMA/
    • <sample>.chunk<X>_flnc.json: Metadata about generated xml file
    • <sample>.chunk<X>_flnc.lima.clips: Clipped sequences
    • <sample>.chunk<X>_flnc.lima.counts: Statistics about detected primers pairs
    • <sample>.chunk<X>_flnc.lima.guess: Statistics about detected primers pairs
    • <sample>.chunk<X>_flnc.lima.report: Detailed statistics on primers pairs for each sequence
    • <sample>.chunk<X>_flnc.lima.summary: General statistics about selected and rejected sequences
    • <sample>.chunk<X>_flnc.primer_5p--primer_3p.bam: Selected sequences
    • <sample>.chunk<X>_flnc.primer_5p--primer_3p.bam.pbi: Pacbio index of selected sequences
    • <sample>.chunk<X>_flnc.primer_5p--primer_3p.consensusreadset.xml: Selected sequences metadata

LIMA clean generated CCS. It selects sequences containing valid pairs of primers and removed it.

ISOSEQ REFINE

Output files
  • 03_ISOSEQ3_REFINE/
    • <sample>.chunk<X>.bam: Sequences sequences
    • <sample>.chunk<X>.bam.pbi: Pacbio index of selected sequences
    • <sample>.chunk<X>.consensusreadset.xml: Metadata
    • <sample>.chunk<X>.filter_summary.json: Number of Full Length, Full Length Non Chimeric, Full Length Non Chimeric PolyA
    • <sample>.chunk<X>.report.csv: Primers and insert length of each read

ISOSEQ REFINE discard chimeric reads.

BAMTOOLS CONVERT

Output files
  • 04_BAMTOOLS_CONVERT/
    • <sample>.chunk<X>.fasta: The reads in fasta format.

BAMTOOLS CONVERT convert reads in BAM format into fasta format.

TAMA POLYA CLEAN UP

Output files
  • 05_GSTAMA_POLYACLEANUP/
    • <sample>.chunk<X>_tama.fa.gz: The polyA tail free reads.
    • <sample>.chunk<X>_polya_flnc_report.txt.gz: Length of removed tails.
    • <sample>.chunk<X>_tama_tails.fa.gz: Sequence of removed tails.

GSTAMA_POLYACLEANUP TAMA cleanup remove polyA tails from the selected reads.

GUNZIP

Output files
  • 06.1_GUNZIP/
    • <sample>.chunk<X>_tama.fa: The polyA tail free reads uncompressed.

GUNZIP Uncompress FLNCs for their alignment with uLTRA (gzip not handled by uLTRA yet).

ULTRA or MINIMAP2

Output files
  • 06.2_ULTRA/ or 06_MINIMAP2/
    • <sample>.chunk<X>.sam: The aligned reads.

MINIMAP2 or uLTRA aligns reads ont the genome.

BIOPERL

Output files
  • 06.3_PERL_BIOPERL/
    • <sample>.chunk<X>_filtered.sam: The aligned reads with spurious alignments removed.

BIOPERL Some CIGAR string sometimes with a gap (N). This can happen when using GFF file converted to GTF file. See Issue #11 from uLTRA repo.

SAMTOOLS SORT

Output files
  • 07_SAMTOOLS_SORT/
    • <sample>.chunk<X>_sorted.bam: The sorted aligned reads.

SAMTOOLS SORT sort the aligned reads and convert the sam file in bam file.

TAMA COLLAPSE

Output files
  • 08_GSTAMA_COLLAPSE/
    • <sample>.chunk<X>_collapsed.bed: This is a bed12 format file containing the final collapsed version of your transcriptome
    • <sample>.chunk<X>_local_density_error.txt: This file contains the log of filtering for local density error around the splice junctions
    • <sample>.chunk<X>_polya.txt: This file contains the reads with potential poly A truncation
    • <sample>.chunk<X>_read.txt: This file contains information for all mapped reads from the input SAM/BAM file.
    • <sample>.chunk<X>_strand_check.txt: This file shows instances where the sam flag strand information contrasted the GMAP strand information.
    • <sample>.chunk<X>_trans_read.bed: This file uses bed12 format to show the transcript model for each read based on the mapping prior to collapsing.This file uses bed12 format to show the transcript model for each read based on the mapping prior to collapsing.
    • <sample>.chunk<X>_trans_report.txt: This file contains collapsing information for each transcript
    • <sample>.chunk<X>_varcov.txt: This file contains the coverage information for each variant detected.
    • <sample>.chunk<X>_variants.txt: This file contains the variants called

TAMA COLLAPSE TAMA Collapse is a tool that allows you to collapse redundant transcript models in your Iso-Seq data.

TAMA FILE LIST

Output files
  • 09_GSTAMA_FILELIST/
    • <sample>.tsv: A tsv listing bed files to merge with TAMA merge

TAMA FILELIST is a home script for generating input file list for TAMA merge.

TAMA MERGE

Output files
  • 10_GSTAMA_MERGE/
    • <sample>.bed: This is the main merged annotation file.
    • <sample>_gene_report.txt: This contains a report of the genes from the merged file.
    • <sample>_merge.txt: This contains a bed12 format file which shows the coordinates of each input transcript matched to the merged transcript ID.
    • <sample>_trans_report.txt: This contains the source information for each merged transcript.

TAMA MERGE TAMA Merge is a tool that allows you to merge multiple transcriptomes while maintaining source information.

MultiQC

Output files
  • multiqc/
    • multiqc_report.html: a standalone HTML file that can be viewed in your web browser.
    • multiqc_data/: directory containing parsed statistics from the different tools used in the pipeline.
    • multiqc_plots/: directory containing static images from the report in various formats.

MultiQC is a visualization tool that generates a single HTML report summarising all samples in your project. Most of the pipeline QC results are visualised in the report and further statistics are available in the report data directory.

Results generated by MultiQC collate pipeline QC from supported tools e.g. FastQC. The pipeline has special steps which also allow the software versions to be reported in the MultiQC output for future traceability. For more information about how to use MultiQC reports, see http://multiqc.info.

Pipeline information

Output files
  • pipeline_info/
    • Reports generated by Nextflow: execution_report.html, execution_timeline.html, execution_trace.txt and pipeline_dag.dot/pipeline_dag.svg.
    • Reports generated by the pipeline: pipeline_report.html, pipeline_report.txt and software_versions.yml. The pipeline_report* files will only be present if the --email / --email_on_fail parameter’s are used when running the pipeline.
    • Reformatted samplesheet files used as input to the pipeline: samplesheet.valid.csv.

Nextflow provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage.