nf-core/airrflow
B-cell and T-cell Adaptive Immune Receptor Repertoire (AIRR) sequencing analysis pipeline using the Immcantation framework
1.2.0
). The latest
stable release is
4.1.0
.
nf-core/bcellmagic: Output
This document describes the output produced by the pipeline. Most of the plots are taken from the MultiQC report, which summarises results at the end of the pipeline.
Pipeline overview
The pipeline is built using Nextflow and processes data using the following steps:
- FastQC - read quality control
- Preprocessing
- Filter sequence quality - filter sequences by quality
- Mask primers - Masking primers
- Pair mates - Pairing sequence mates.
- Cluster sets - Cluster sequences according to similarity.
- Build consensus - Build UMI consensus
- Re-pair mates - Re-pairing sequence mates.
- Assemble mates - Assemble sequence mates.
- Remove duplicates - Remove read duplicates.
- Filter sequences for at least 2 representative Filter sequences that do not have at least 2 reads assigned.
- IgBlast
- Genotyping
- Defining clones - Defining clonal B-cell populations
- Reconstructing germlines - Reconstruct gene calls of germline sequences
- Clonal analysis - Clonal analysis.
- Repertoire comparison - Repertoire comparison.
- Log parsing - Log parsing.
- MultiQC - MultiQC
FastQC
FastQC gives general quality metrics about your reads. It provides information about the quality score distribution across your reads, the per base sequence content (%T/A/G/C). You get information about adapter contamination and other overrepresented sequences.
For further reading and documentation see the FastQC help.
NB: The FastQC plots displayed in the MultiQC report shows untrimmed reads. They may contain adapter sequence and potentially regions with low quality. To see how your reads look after trimming, look at the FastQC reports in the trim_galore
directory.
Output directory: results/fastqc
sample_fastqc.html
- FastQC report, containing quality metrics for your untrimmed raw fastq files
zips/sample_fastqc.zip
- zip file containing the FastQC report, tab-delimited data file and plot images
Preprocessing
Filter sequence quality
Filters reads that are below a quality threshold by using the tool FilterSeq from the Presto Immcantation toolset. The default quality threshold is 20.
Output directory: results/preprocessing/filter_by_sequence_quality
For each analyzed sample there is a subfolder containing:
sample_command_log.txt
- Log of the process that will be parsed to generate a report.
*.tab
- table containing read ID and quality.
Mask primers
Masks primers that are provided in the C-primers and V-primers input files. It uses the tool MaskPrimers of the Presto Immcantation toolset.
Output directory: results/preprocessing/mask_primers
For each analyzed sample there is a subfolder containing:
sample_command_log.txt
- Log of the process that will be parsed to generate a report.
Pair mates
Pair read mates using PairSeq from the Presto Immcantation toolset.
Output directory: results/preprocessing/pair_sequences
For each analyzed sample there is a subfolder containing:
sample_command_log.txt
- Log of the process that will be parsed to generate a report.
Cluster sets
Cluster sequences according to similarity, using ClusterSets set. This step is introduced to deal with too low UMI diversity.
Output directory: results/preprocessing/cluster_sets
For each analyzed sample there is a subfolder containing:
sample_command_log.txt
- Log of the process that will be parsed to generate a report.
Build UMI consensus
Build consensus of UMI from all sequences that were annotated to have the same UMI. Uses BuildConsensus.
Output directory: results/preprocessing/build_consensus
For each analyzed sample there is a subfolder containing:
sample_command_log.txt
- Log of the process that will be parsed to generate a report.
*.tab
- Parsed log containing the sequence barcodes and primers info
Re-pair mates
Re-pair read mates using PairSeq from the Presto Immcantation toolset.
Output directory: results/preprocessing/repair_mates
For each analyzed sample there is a subfolder containing:
sample_command_log.txt
- Log of the process that will be parsed to generate a report.
Assemble mates
Assemble read mates using AssemblePairs from the Presto Immcantation toolset.
Output directory: results/preprocessing/assemble_pairs
For each analyzed sample there is a subfolder containing:
sample_command_log.txt
- Log of the process that will be parsed to generate a report.
sample_assemble_pairs_logs.tab
- Parsed log contaning the sequence barcodes and assemble pairs.
Remove duplicates
Remove duplicates using CollapseSeq from the Presto Immcantation toolset.
Output directory: results/preprocessing/deduplicate
For each analyzed sample there is a subfolder containing:
sample_command_log.txt
- Log of the process that will be parsed to generate a report.
sample_deduplicate_logs.tab
- Parsed log contaning the sequence barcodes and deduplicated pairs.
Filter sequences for at least 2 representative
Remove sequences which do not have 2 representative using SplitSeq from the Presto Immcantation toolset.
Output directory: results/preprocessing/filter_representative_2
For each analyzed sample there is a subfolder containing:
sample_command_log.txt
- Log of the process that will be parsed to generate a report.
IgBlast
Assign genes from the IGblast database using AssignGenes and generating a table with MakeDB. Non-functional sequences are removed with ParseDb. Sequences in are additionally converted to a fasta file with the ConvertDb tool.
Output directory: results/preprocessing/igblast
For each analyzed sample there is a subfolder containing:
sample_command_log.txt
- Log of the process that will be parsed to generate a report.
fasta/*.fasta
- Blast results converted to fasta fall with genotype V-call annotated in the header.
table/*.tab
- Table in ChangeO format contaning the assigned gene information and metadata provided in the starting metadata sheet.
Determining genotype and hamming distance threshold
Determining genotype and the hamming distance threshold of the junction regions for clonal determination using the tigGER and Shazam.
Output directory: results/genotyping
For each subject (patient) there is a subfolder containing:
threshold.txt
- Hamming distance threshold of the Junction regions as determined by Shazam.
Hamming_distance_threshold.pdf
- Plot of the Hamming distance distribution between junction regions displaying the threshold for clonal assignment as determined by Shazam.
genotype.pdf
- Plot representing the patient genotype assessed by TigGER.
igh_genotyped.tab
- Table in ChangeO additionally containing the assigned genotype in V_CALL_GENOTYPED.
v_genotype.fasta
- Fasta file containing the full sequences for all V genes assigned to the patient.
Clone assignment
Assigning clones to the sequences obtained from IgBlast with the DefineClones Immcantation tool.
Output directory: results/clone_assignment
For each subject (patient) there is a subfolder containing:
command_log.txt
- Log of the process that will be parsed to generate a report.
igh_genotyped_clone-pass.tab
- Table in ChangeO format contaning the assigned gene information and an additional field with the clone number.
Reconstructing germlines
Reconstructing the germline sequences with the CreateGermlines Immcantation tool.
Output directory: results/germlines
For each subject (patient) there is a subfolder containing:
command_log.txt
- Log of the process that will be parsed to generate a report.
table/igh_genotyped_clone-pass_germ-pass.tab
- Table in ChangeO format contaning the assigned gene information and an additional field with the germline reconstructed gene calls.
Clonal analysis
Reconstructing clonal linage with the Alakazam R package from the Immcantation toolset. Calculating and plotting several clone statistics.
Output directory: results/clonal_analysis
For each subject (patient) there is a subfolder containing the processed sequence information (ChangeO format) for all sequences of that subject; and clonal_analysis.zip
file, which uncompressed contains:
Clone_lineage/
Clones_table_patient.tsv
: contains a summary of the clones found for the patient, and the number of unique and total sequences identified in each clone.Clone_tree_plots
: contain a rooted graphical representation of each of the clones.Clone_lineage
: contain a GraphmL exported format of the plots.All_graphs_patient.graphml
contains all graphs for that patient.
Clone_numbers/
- Number of clones and number of sequences per clone, patient-wise and cell population wise.
Clone_overlap/
- Plots for representing the clone overlap in number of clones and number of sequences between different time-points and cell populations of one patient.
Repertoire comparison
Calculation of several repertoire characteristics (diversity, abundance) for comparison between patients, time points and cell popultions.
Output directory: results/repertoire_comparison
patient_tables/
- Changeo format table containing the processed sequence information for all subjects.
repertoire_comparison.zip
- Contains the repertoire comparison results: Abundance, Diversity, Isotype, Mutational_load and V-family tables and plots. Comparison between treatments and subjects.
Log parsing
Parsing the logs from the previous processes.
Output directory: results/parsing_logs
- A table summarizing of the number of sequences after the most important steps is shown.
MultiQC
MultiQC is a visualisation tool that generates a single HTML report summarising all samples in your project. Most of the pipeline QC results are visualised in the report and further statistics are available in within the report data directory.
The pipeline has special steps which allow the software versions used to be reported in the MultiQC output for future traceability.
Output directory: results/multiqc
Project_multiqc_report.html
- MultiQC report - a standalone HTML file that can be viewed in your web browser.
multiqc_data/
- Directory containing parsed statistics from the different tools used in the pipeline.
multiqc_plots/
- Directory containing plots shown in the MultiQC report.
For more information about how to use MultiQC reports, see http://multiqc.info