nf-core/isoseq
Edit

Genome annotation with PacBio Iso-Seq. Takes raw subreads as input, generate Full Length Non Chemiric (FLNC) sequences and produce a bed annotation.

isoseqisoseq-3rnatamaultra

These pages are for an old version of the pipeline (1.1.0). The latest stable release is2.0.0.

Launch version 1.1.0 https://github.com/nf-core/isoseq

Introduction

This document describes the output produced by the pipeline.

The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory.

Pipeline overview

The pipeline is built using Nextflow and processes data using the following steps:

CCS - Generate CCS sequences
LIMA - Remove primer sequences from CCS
ISOSEQ REFINE - Detect and remove chimerics reads
BAMTOOLS CONVERT - Convert bam file into fasta file
TAMA POLYA CLEAN UP - Detect and trim polyA tails reads
GUNZIP - Decompress FLNC fastas (uLTRA path only)
ULTRA or MINIMAP2 - Map FLNCs on genome
BIOPERL - Remove spurious alignments (uLTRA path only, Issue #11)
SAMTOOLS SORT - Sort alignment and convert sam file into bam file
TAMA FILE LIST - Prepare list file for TAMA collapse
TAMA COLLAPSE - Clean gene models
TAMA MERGE - Merge all annotations into one for each sample with TAMA merge
Pipeline information - Report metrics generated during the workflow execution

CCS

Output files

01_PBCCS/
- <sample>.chunk<X>.bam: The CCS sequences
- <sample>.chunk<X>.bam.pbi: The Pacbio index of CCS files
- <sample>.chunk<X>.metrics.json.gz: Statistics for each zmws
- <sample>.chunk<X>.report.json: General statistics about generated CCS sequences in json format
- <sample>.chunk<X>.report.txt: General statistics about generated CCS sequences in txt format

CCS generate a Circular Consensus Sequence from subreads. It reports the number of selected and discarded zmws and the reason why.

LIMA

Output files

02_LIMA/
- <sample>.chunk<X>_flnc.json: Metadata about generated xml file
- <sample>.chunk<X>_flnc.lima.clips: Clipped sequences
- <sample>.chunk<X>_flnc.lima.counts: Statistics about detected primers pairs
- <sample>.chunk<X>_flnc.lima.guess: Statistics about detected primers pairs
- <sample>.chunk<X>_flnc.lima.report: Detailed statistics on primers pairs for each sequence
- <sample>.chunk<X>_flnc.lima.summary: General statistics about selected and rejected sequences
- <sample>.chunk<X>_flnc.primer_5p--primer_3p.bam: Selected sequences
- <sample>.chunk<X>_flnc.primer_5p--primer_3p.bam.pbi: Pacbio index of selected sequences
- <sample>.chunk<X>_flnc.primer_5p--primer_3p.consensusreadset.xml: Selected sequences metadata

LIMA clean generated CCS. It selects sequences containing valid pairs of primers and removed it.

ISOSEQ REFINE

Output files

03_ISOSEQ3_REFINE/
- <sample>.chunk<X>.bam: Sequences sequences
- <sample>.chunk<X>.bam.pbi: Pacbio index of selected sequences
- <sample>.chunk<X>.consensusreadset.xml: Metadata
- <sample>.chunk<X>.filter_summary.json: Number of Full Length, Full Length Non Chimeric, Full Length Non Chimeric PolyA
- <sample>.chunk<X>.report.csv: Primers and insert length of each read

ISOSEQ REFINE discard chimeric reads.

BAMTOOLS CONVERT

Output files

04_BAMTOOLS_CONVERT/
- <sample>.chunk<X>.fasta: The reads in fasta format.

BAMTOOLS CONVERT convert reads in BAM format into fasta format.

TAMA POLYA CLEAN UP

Output files

05_GSTAMA_POLYACLEANUP/
- <sample>.chunk<X>_tama.fa.gz: The polyA tail free reads.
- <sample>.chunk<X>_polya_flnc_report.txt.gz: Length of removed tails.
- <sample>.chunk<X>_tama_tails.fa.gz: Sequence of removed tails.

GSTAMA_POLYACLEANUP TAMA cleanup remove polyA tails from the selected reads.

GUNZIP

Output files

06.1_GUNZIP/
- <sample>.chunk<X>_tama.fa: The polyA tail free reads uncompressed.

GUNZIP Uncompress FLNCs for their alignment with uLTRA (gzip not handled by uLTRA yet).

ULTRA or MINIMAP2

Output files

06.2_ULTRA/ or 06_MINIMAP2/
- <sample>.chunk<X>.sam: The aligned reads.

MINIMAP2 or uLTRA aligns reads ont the genome.

BIOPERL

Output files

06.3_PERL_BIOPERL/
- <sample>.chunk<X>_filtered.sam: The aligned reads with spurious alignments removed.

BIOPERL Some CIGAR string sometimes with a gap (N). This can happen when using GFF file converted to GTF file. See Issue #11 from uLTRA repo.

SAMTOOLS SORT

Output files

07_SAMTOOLS_SORT/
- <sample>.chunk<X>_sorted.bam: The sorted aligned reads.

SAMTOOLS SORT sort the aligned reads and convert the sam file in bam file.

TAMA COLLAPSE

Output files

08_GSTAMA_COLLAPSE/
- <sample>.chunk<X>_collapsed.bed: This is a bed12 format file containing the final collapsed version of your transcriptome
- <sample>.chunk<X>_local_density_error.txt: This file contains the log of filtering for local density error around the splice junctions
- <sample>.chunk<X>_polya.txt: This file contains the reads with potential poly A truncation
- <sample>.chunk<X>_read.txt: This file contains information for all mapped reads from the input SAM/BAM file.
- <sample>.chunk<X>_strand_check.txt: This file shows instances where the sam flag strand information contrasted the GMAP strand information.
- <sample>.chunk<X>_trans_read.bed: This file uses bed12 format to show the transcript model for each read based on the mapping prior to collapsing.This file uses bed12 format to show the transcript model for each read based on the mapping prior to collapsing.
- <sample>.chunk<X>_trans_report.txt: This file contains collapsing information for each transcript
- <sample>.chunk<X>_varcov.txt: This file contains the coverage information for each variant detected.
- <sample>.chunk<X>_variants.txt: This file contains the variants called

TAMA COLLAPSE TAMA Collapse is a tool that allows you to collapse redundant transcript models in your Iso-Seq data.

TAMA FILE LIST

Output files

09_GSTAMA_FILELIST/
- <sample>.tsv: A tsv listing bed files to merge with TAMA merge

TAMA FILELIST is a home script for generating input file list for TAMA merge.

TAMA MERGE

Output files

10_GSTAMA_MERGE/
- <sample>.bed: This is the main merged annotation file.
- <sample>_gene_report.txt: This contains a report of the genes from the merged file.
- <sample>_merge.txt: This contains a bed12 format file which shows the coordinates of each input transcript matched to the merged transcript ID.
- <sample>_trans_report.txt: This contains the source information for each merged transcript.

TAMA MERGE TAMA Merge is a tool that allows you to merge multiple transcriptomes while maintaining source information.

MultiQC

Output files

multiqc/
- multiqc_report.html: a standalone HTML file that can be viewed in your web browser.
- multiqc_data/: directory containing parsed statistics from the different tools used in the pipeline.
- multiqc_plots/: directory containing static images from the report in various formats.

MultiQC is a visualization tool that generates a single HTML report summarising all samples in your project. Most of the pipeline QC results are visualised in the report and further statistics are available in the report data directory.

Results generated by MultiQC collate pipeline QC from supported tools e.g. FastQC. The pipeline has special steps which also allow the software versions to be reported in the MultiQC output for future traceability. For more information about how to use MultiQC reports, see http://multiqc.info.

Pipeline information

Output files

pipeline_info/
- Reports generated by Nextflow: execution_report.html, execution_timeline.html, execution_trace.txt and pipeline_dag.dot/pipeline_dag.svg.
- Reports generated by the pipeline: pipeline_report.html, pipeline_report.txt and software_versions.yml. The pipeline_report* files will only be present if the --email / --email_on_fail parameter’s are used when running the pipeline.
- Reformatted samplesheet files used as input to the pipeline: samplesheet.valid.csv.

Nextflow provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage.

On this page

nf-core/isoseqEdit

Introduction

Pipeline overview

CCS

LIMA

ISOSEQ REFINE

BAMTOOLS CONVERT

TAMA POLYA CLEAN UP

GUNZIP

ULTRA or MINIMAP2

BIOPERL

SAMTOOLS SORT

TAMA COLLAPSE

TAMA FILE LIST

TAMA MERGE

MultiQC

Pipeline information

nf-core/isoseq
Edit