This document describes the output produced by the pipeline.
The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory.
The pipeline is built using Nextflow and processes data using the following steps:
- CCS - Generate CCS sequences
- LIMA - Remove primer sequences from CCS
- ISOSEQ REFINE - Detect and remove chimerics reads
- BAMTOOLS CONVERT - Convert bam file into fasta file
- TAMA POLYA CLEAN UP - Detect and trim polyA tails reads
- GUNZIP - Decompress FLNC fastas (uLTRA path only)
- ULTRA or MINIMAP2 - Map FLNCs on genome
- BIOPERL - Remove spurious alignments (uLTRA path only, Issue #11)
- SAMTOOLS SORT - Sort alignment and convert sam file into bam file
- TAMA FILE LIST - Prepare list file for TAMA collapse
- TAMA COLLAPSE - Clean gene models
- TAMA MERGE - Merge all annotations into one for each sample with TAMA merge
- Pipeline information - Report metrics generated during the workflow execution
<sample>.chunk<X>.bam: The CCS sequences
<sample>.chunk<X>.bam.pbi: The Pacbio index of CCS files
<sample>.chunk<X>.metrics.json.gz: Statistics for each zmws
<sample>.chunk<X>.report.json: General statistics about generated CCS sequences in json format
<sample>.chunk<X>.report.txt: General statistics about generated CCS sequences in txt format
CCS generate a Circular Consensus Sequence from subreads. It reports the number of selected and discarded zmws and the reason why.
<sample>.chunk<X>_flnc.json: Metadata about generated xml file
<sample>.chunk<X>_flnc.lima.clips: Clipped sequences
<sample>.chunk<X>_flnc.lima.counts: Statistics about detected primers pairs
<sample>.chunk<X>_flnc.lima.guess: Statistics about detected primers pairs
<sample>.chunk<X>_flnc.lima.report: Detailed statistics on primers pairs for each sequence
<sample>.chunk<X>_flnc.lima.summary: General statistics about selected and rejected sequences
<sample>.chunk<X>_flnc.primer_5p--primer_3p.bam: Selected sequences
<sample>.chunk<X>_flnc.primer_5p--primer_3p.bam.pbi: Pacbio index of selected sequences
<sample>.chunk<X>_flnc.primer_5p--primer_3p.consensusreadset.xml: Selected sequences metadata
LIMA clean generated CCS. It selects sequences containing valid pairs of primers and removed it.
<sample>.chunk<X>.bam: Sequences sequences
<sample>.chunk<X>.bam.pbi: Pacbio index of selected sequences
<sample>.chunk<X>.filter_summary.json: Number of Full Length, Full Length Non Chimeric, Full Length Non Chimeric PolyA
<sample>.chunk<X>.report.csv: Primers and insert length of each read
ISOSEQ REFINE discard chimeric reads.
<sample>.chunk<X>.fasta: The reads in fasta format.
BAMTOOLS CONVERT convert reads in BAM format into fasta format.
TAMA POLYA CLEAN UP
<sample>.chunk<X>_tama.fa.gz: The polyA tail free reads.
<sample>.chunk<X>_polya_flnc_report.txt.gz: Length of removed tails.
<sample>.chunk<X>_tama_tails.fa.gz: Sequence of removed tails.
GSTAMA_POLYACLEANUP TAMA cleanup remove polyA tails from the selected reads.
<sample>.chunk<X>_tama.fa: The polyA tail free reads uncompressed.
GUNZIP Uncompress FLNCs for their alignment with uLTRA (gzip not handled by uLTRA yet).
ULTRA or MINIMAP2
<sample>.chunk<X>.sam: The aligned reads.
<sample>.chunk<X>_filtered.sam: The aligned reads with spurious alignments removed.
<sample>.chunk<X>_sorted.bam: The sorted aligned reads.
SAMTOOLS SORT sort the aligned reads and convert the sam file in bam file.
<sample>.chunk<X>_collapsed.bed: This is a bed12 format file containing the final collapsed version of your transcriptome
<sample>.chunk<X>_local_density_error.txt: This file contains the log of filtering for local density error around the splice junctions
<sample>.chunk<X>_polya.txt: This file contains the reads with potential poly A truncation
<sample>.chunk<X>_read.txt: This file contains information for all mapped reads from the input SAM/BAM file.
<sample>.chunk<X>_strand_check.txt: This file shows instances where the sam flag strand information contrasted the GMAP strand information.
<sample>.chunk<X>_trans_read.bed: This file uses bed12 format to show the transcript model for each read based on the mapping prior to collapsing.This file uses bed12 format to show the transcript model for each read based on the mapping prior to collapsing.
<sample>.chunk<X>_trans_report.txt: This file contains collapsing information for each transcript
<sample>.chunk<X>_varcov.txt: This file contains the coverage information for each variant detected.
<sample>.chunk<X>_variants.txt: This file contains the variants called
TAMA COLLAPSE TAMA Collapse is a tool that allows you to collapse redundant transcript models in your Iso-Seq data.
TAMA FILE LIST
<sample>.tsv: A tsv listing bed files to merge with TAMA merge
TAMA FILELIST is a home script for generating input file list for TAMA merge.
<sample>.bed: This is the main merged annotation file.
<sample>_gene_report.txt: This contains a report of the genes from the merged file.
<sample>_merge.txt: This contains a bed12 format file which shows the coordinates of each input transcript matched to the merged transcript ID.
<sample>_trans_report.txt: This contains the source information for each merged transcript.
TAMA MERGE TAMA Merge is a tool that allows you to merge multiple transcriptomes while maintaining source information.
multiqc_report.html: a standalone HTML file that can be viewed in your web browser.
multiqc_data/: directory containing parsed statistics from the different tools used in the pipeline.
multiqc_plots/: directory containing static images from the report in various formats.
MultiQC is a visualization tool that generates a single HTML report summarising all samples in your project. Most of the pipeline QC results are visualised in the report and further statistics are available in the report data directory.
Results generated by MultiQC collate pipeline QC from supported tools e.g. FastQC. The pipeline has special steps which also allow the software versions to be reported in the MultiQC output for future traceability. For more information about how to use MultiQC reports, see http://multiqc.info.
- Reports generated by Nextflow:
- Reports generated by the pipeline:
pipeline_report*files will only be present if the
--email_on_failparameter’s are used when running the pipeline.
- Reformatted samplesheet files used as input to the pipeline:
- Reports generated by Nextflow:
Nextflow provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage.