nf-core/cageseq
CAGE-sequencing analysis pipeline with trimming, alignment and counting of CAGE tags.
22.10.6.
Learn more.
Introduction
ComputationalRegulatoryGenomicsICL/customcageq is a Nextflow pipeline to process CAGE sequencing data from raw reads to the identification of consensus clusters and enhancers. The pipeline is specifically designed to be used for assessing the quality of the CAGE data, and calling promoters and enhancers with good default values that can be later fine tuned as needed. The pipeline also supports the re-analysis of the reads with updated parameters.
Input
Parameter file
Customizing the analysis the pipeline requires about 40 different parameters. These can be defined in a single params.yaml file.
An example is shown below.
# pipeline parameters
fullpipeline: true
maponly: false
cageronly: false
gtf: "testdata/sacCer3_genome/sacCer3.ensGene.gtf"
# preprocessing parameters
input: "docs/examples/samplesheet_sacer_pe.csv"
infolder:
sample_name_fields:
# mapping parameters
genome_name: "sacCer3"
fasta: "testdata/sacCer3_genome/sacCer3.fa"
index: "testdata/sacCer3_genome/sacCer3_star_index/"
seq_platform: "illumina"
seq_center: false
# whether to take the uniquely mapped only
unique_only: true
# whether to remove reads that do not start with G
remove_non_g: false
# CAGEr parameters
cager_sample_file: "docs/examples/sample_list.csv" # with sorted list of bigwigs, if mapping is run elsewhere
# BSgenome
forgeseed:
sourcedir:
bsgenome: "BSgenome.Scerevisiae.UCSC.sacCer3"
# parameter for correlation calculation
corrplot_tagCountThreshold: 1
# parameters for normalization
norm_range_min: 5
norm_range_max: 10000
norm_method: "powerLaw"
alpha:
t_norm: 1000000
# parameters for tag clustering
sample_num_thr: 1
ctss_thr: 1
distclu_maxDist: 20
keepSingletonsAbove: 5
iq_low: 0.1
iq_high: 0.9
# plotting for tagclusters QC
iqw_tpm_threshold: 3
tssregion_up: -3000
tssregion_down: 3000
tsslogo_upstream: 35
# parameters for consensus clusters
consensus_thr: 2
consensus_dist: 100
# parameters for enhancer calling
cfBalanceThreshold: 0.95
unexpressed: 0
minSamples: 0where the pipeline parameters that should be provided in all runs are
fullpipeline,maponly, andcageronlydefine the mode of the pipeline run. With thefullpipelineoption, every step will run.maponlystops at creatingbigwigand/orbamfiles, as well as a MultiQC report.cageronlystarts frombigwigorbamfiles and finishes with a CAGEr report.gtfspecifies a GTF file with a whole-genome annotation to create a STAR index (if a FASTA file is provided) and/or to automatically create a TxDb annotation object to annotate tag clusters.
The parameters specific to mapping, can be left empty when running in cageronly mode:
inputspecifies the input CSV samplesheet. This option is mutually exclusive withinfolder.infolderspecifies the input directory with FASTQ files (stored together for all samples or located in per-sample subdirectories). This option is mutually exclusive withinput, and may be used together withsample_name_fields.sample_name_fieldsis a supporting parameter forinfolderin case your sample name has underscore(s) in it. By default, only the first part of the string before the first underscore is taken for samplename. If you have more, likemy_sample_name_S1_L001_R1_001.fastq.gz, with this parameter you may specify how many underscore separated fields the sample name has in the filename. In themy_sample_name_S1_L001_R1_001.fastq.gzexample, this parameter should be = 3.genome_namespecifies the name of the reference genome. It is used as meta informationgenomespecifies a FASTA file containing a reference genome. This option is mandatory, unlessindexis set.indexspecifies a directory with a genome index (bowtie2orSTAR). This is a mandatory option, unlessfastais set.seq_platformspecifies the sequencing platform used. Required for mapping withSTAR.seq_centerspecifies the name of the sequencing center. Required for mapping withSTAR.unique_onlyspecifies if only uniquely mapped reads are considered for downstream analysis. Required for mapping withSTAR. Not considered when usingbowtie2.remove_non_gspecifies whether to keep only those reads that start withGbase, as expected after the CAGE protocol. This step is expected to remove about 15-20% of the reads that would likely be non canonical initiators, but they might take part in ohter biological processes.params_trimgalorespecifies additional parameters that can be passed to TrimGalore!
The parameters specific to CAGEr and CAGEfightR analysis, can be left empty when running in maponly mode:
-
cager_sample_filespecifies the input CSV samplesheet including the name of the samples, their pairedness status, and the location of bigwig or bam files. Optionally, a fourth column is used that sepcifies which samples should be removed, merged, or kept as is. This is achieved by checking if the row is empty (remove), its content is unique (keep as is), or shared with another sample (merge). -
datatypespecifies the type of mapped input in the CSV samplesheet:bamorbigwigis accepted. Default isbigwig. -
forgeseedspecifies a seed file for BSgenome forging (see the Advanced BSgenomeForge usage vignette for details). The seed file should not contain theseqs_srcdirfield (instead, the absolute or relative path to the source directory is set with thesourcediroption, see below). This option requiressourcedirand is mutually exclusive withbsgenome. -
sourcedirspecifies a directory containing either a set of FASTA files, one per reference chromosome, or a 2bit file for the whole reference genome. See the Advanced BSgenomeForge usage vignette for details. The seed file should be written according to the contents of this directory. This option requiresforgeseedand is mutually exclusive withbsgenome. -
bsgenomespecifies the BSgenome R package to use. If it is a file name (which should have a full path and the.tar.gzextension), then the package will be taken from the specified location; otherwise, the pipeline will try to install a BSgenome R package with the namebsgenome.packageon the fly (see examples below). This option is mutually exclusive withforgeseedandsourcedir. -
corrplot_tagCountThresholdis a threshold above which (raw and normalized) CTSS are considered for the correlation plot. -
norm_methodis the method used for normalizing the samples. Options aresimpleTpmto covert tag counts to tags per million,powerLawto normalize to a reference power-law distribution, ornoneto keep using the raw tag counts in downstream analyses. Case sensitive. -
norm_range_minandnorm_range_maxdefines the lower and upper thresold for fitting the power-law distribution and calculate the slope for normalization. Only used whennorm_methodispowerLaw. -
alphauser specified alpha, the-1 *fitted slope in the log-log representation of the power-law distribution. If none, the average across samples is calculated and used. Considered forpowerLawnormalization only. Note: this value cannot be less than 1.05, if the user provides a lower value the code automatically assumes 1.05. -
t_normtotal number of CAGE tags in the reference power-law distribution. SettingT = 10^6results in normalized values that correspond to tags per million. Considered forpowerLawnormalization only. -
sample_num_thrandctss_thrare parameters for filtering low expressed CTSS before clustering.ctss_thrspecifies the lower threshold above which CTSS are considered, andsample_num_thrspecifies the number of samples where this threshold should be passed. -
distclu_maxDistspecifies the maximum distance for distance-based clustering (distclu). -
keepSingletonsAbovedefines the tpm threshold above which even a single CTSS is kept during clustering. -
The
iq_lowandiq_highparameters are used to define the lower and upper quantile boundaries of the interquartile range within which the majority of the signal lies. -
iqw_tpm_thresholdis a threshold above which tag clusters are considered for the interquartile width distribution plot. -
tssregion_upandtssregion_downare used for annotation with ChIPseeker. These correspond to the upstream and downstream distance to consider into TSS region for ChIPseeker annotation. tssregion_up should be negative, tssregion_down should be positive. -
tsslogo_upstreamis used for plotting the TSS logos. This parameter specifies the number of bases to inlcude upstream of the TSS. -
consensus_thrandconsensus_distare used for defining the consensus clusters.consensus_thrspecifies the TPM threshold above which tag clusters are considered for consensus clusters, andconsensus_distdefine the maximum distance between the interquartile ranges of tag clusters to be joined together into consensus clusters. -
cfBalanceThresholdis used for enhancer calling. It defines the balance threshold above which bidirectionality is considered balanced. -
unexpressedandminSamplesare used for selecting only supported enhancers.unexpressedis a non inclusive lower TPM boundary for expression when calculating support of enhancers.minSamplesis a non-inclusive lower boundary for the number of samples where the clusters should show bidirectionality. -
Additional arguments can be:
params-trimgalorespecifies any options that can be passed toTrimGalore!. This option is useful for any non-standard read processing (for example, for CAGEscan reads that require the removal of a fixed number of nucleotides from the 5’-ends of the forward and reverse reads (Bertin et al., 2017)). The string with the parameters forTrimGalore!must be surrounded by single quotes.nogtrimmakes the pipeline skip the G-trimming step. This option is useful for processing non-CAGE data (for example, CAGEscan reads which do not seem to require trimming of a 5’-G(Bertin et al., 2017)). This option can be used together withparams-trimgalore(see an example below).bowtie2switches the aligner fromSTARtobowtie2. This option is compatible with eitherindexorfasta.dedupswitches on PCR duplicate removal (not shown in the pipeline map above and is switched off by default).dist Lsets an optical duplicate distanceLto remove optical duplicates, in addition to PCR duplicates (seesamtools markdup, option-d). This option requiresdedup.
Note: All pipeline options may be provided to the nextflow command starting with a double dash (--). All Nextflow options start with a single dash (-).
Complete pipeline
To run the complete pipeline, starting from raw reads, the input is either single-end (SE) or paired-end (PE) raw CAGE reads. Only one type of reads (either SE or PE) can be used in one run of the pipeline. The user can list read files in a samplesheet or provide a path to a directory containing the files (stored all together or in subdirectories within the provided folder).
Samplesheet
The samplesheet has to be a comma-separated file with 3 columns and a header row. A PE example is shown below (this file and a SE version can be found at docs/examples/samplesheet_sacer_\[pe/se\].csv). It is recommended to use absolute paths for the input files.
If this option is used, the input is defined with the --input parameter.
sample,fastq_1,fastq_2,single_end
S10,testdata/sacCer_fastq/pe/S10_S6_L001_R1_001.fastq.gz,testdata/sacCer_fastq/pe/S10_S6_L001_R2_001.fastq.gz,False
S14,testdata/sacCer_fastq/pe/S14_S8_L001_R1_001.fastq.gz,testdata/sacCer_fastq/pe/S14_S8_L001_R2_001.fastq.gz,False
S14,testdata/sacCer_fastq/pe/S14_S8_L002_R1_001.fastq.gz,testdata/sacCer_fastq/pe/S14_S8_L002_R2_001.fastq.gz,Falsewhere
sampleis a unique identifier of a sample;fastq_1(andfastq_2in the case of paired-end reads) is a full path to the read libraries. In case of paired-end reads,fastq_1contains the full path to forward reads, whilefastq_2contains the full path to reverse reads. One sample can be represented by more than one library if lanes are stored separately;single_endshould be set toTruefor single-end reads and toFalsefor paired-end reads.
For paired-end reads, fastq_2 should contain the full path to reverse reads, while single_end should be set to False.
- Samplesheet
- Additional information on the samplesheet.
Starting from a folder path
If a foldername including fastq files is provided, the --infolder parameter should be selected.
In this case, the software assumes that the file name follows Illumina naming conventions and has the following structure: <sample name>_<sample number>_L00<lane number>_<R1/R2/1/2>_001.fastq.gz.
From the existence of R2 (or 2) values in the expected position, the software assigns the input to be paired end or single end.
By default, it takes the first value before the underscore which should be the sample name.
However, it can be overwritten in case the sample name itself contains underscore values.
The additional parameter sample_name_fields should be set to how many underscore separated parts the sample name has.
_Note, that if the sample name had any dash (-) it is converted to underscore (_) to ensure compatibility with subsequent processes._
CAGEr subpipeline
To run the analysis with CAGEr with already existing bigwig or bam files, the input is another sample sheet with 4 columns and a header row.
the 4th column, new_name defines which samples should be merged (provide the same names for the samples), dropped (leave the field empty), or kept as is (match the with the id field).
An example is shown below (this file can be found at docs/examples/samplesheet_cager.csv). It is recommended to use absolute paths for the input files.
id,single_end,path,new_name
S10,false,[testdata/bigwig/S10.Signal.UniqueMultiple.str1.out.wig.bw testdata/bigwig/S10.Signal.UniqueMultiple.str2.out.wig.bw],S10
S14,false,[testdata/bigwig/S14.Signal.UniqueMultiple.str1.out.wig.bw testdata/bigwig/S14.Signal.UniqueMultiple.str2.out.wig.bw],S14This file may be created in two steps as shown below.
ls /path/to/bigwig_folder/*bw > bigwigs.txt`.
./bin/sample_list_from_bigwig_list.py --filepath bigwigs.txt --singleend ["true" or "false"]First, a simple txt file listing all bigwigs is created. Next, the custom script is run specifying the conversion which will create a file called sample_list.csv.
This script accepts two additional parameters: --delimiter and --field specifying if any part of the filename should be excluded. This might be necessary because a filename should not start with a number, as filename will be propagated as sample name and R is sensitive to names starting with a number.
Running the pipeline
After you cloned this repository, the typical command for running the pipeline is as follows:
nextflow run customcageq/main.nf -params-file params.yaml -profile dockerThis will launch the pipeline with the docker configuration profile. See below for more information about profiles.
Note that the pipeline will create the following files in your working directory:
work # Directory containing the nextflow working files
<OUTDIR> # Finished results in specified location (defined with --outdir, default is results/)
.nextflow_log # Log file from Nextflow
# Other nextflow hidden files, eg. history of pipeline runs and old logs.Additional information
You can change the maximum number of instances of the same process that can run in parallel (by default, the maximum number of instances of the same process equals 2).
To do this, change the value of the maxForks parameter in conf/base.config.
Limiting this number makes sure that the pipeline does not try to obtain all available resources.
By default, up to 10 different mapping loci are allowed for each read but only uniquely mapping reads are selected to create bigWig files.
If you would like to work with multimappers (or change mapping parameters in any other way), please amend STAR mapping options in conf/modules.config and/or replace the default true value for the unique_only pipeline option with false (in which case bigWig files will be created using the full set of alignments, not only unique ones).
However, if multimapping reads are allowed, then bigWig files will contain non-integer counts for positions where multimappers align.
This is due to the fact that STAR splits the count of 1 between all alignments of the same read.
- nextflow_usage
- Nextflow specific information
Examples
Paired-end reads with locally stored STAR index and BSgenome
Call TSSs from the test yeast paired-end CAGE reads using the locally stored test STAR index and the BSgenome.Scerevisiae.UCSC.sacCer3 R package. The package will be automatically downloaded and installed within the CAGEr container on the fly and will be used there with CAGEr. To run this example, the user needs to provide full paths to the test FASTQ files in samplesheet_sacer_pe_template.csv and the path to a “scratch” (or any other convenient) storage space for the Nextflow work directory:
nextflow run customcageq/main.nf \
--bsgenome BSgenome.Scerevisiae.UCSC.sacCer3 \
--gtf customcageq/assets/sacCer3_genome/sacCer3.ensGene.gtf \
--index customcageq/assets/sacCer3_genome/sacCer3_star_index/ \
--input customcageq/assets/samplesheet_sacer_pe_template.csv \
-profile singularity \
-w /path/to/scratch/workThis example represents a typical use case for processing CAGE data from an organism with an available, locally stored, STAR genome index and a corresponding BSgenome package available in Bioconductor. For example, this is a use case for human CAGE data processing with the hg38 or T2T-CHM13 assembly. Instead of the singularity profile, one may use their institution’s Nextflow profile (see publicly available institutional Nextflow profiles).
Single-end reads with locally stored FASTA and GTF files
Call TSSs from the test yeast single-end CAGE reads (stored in per-sample subdirectories of an input directory) using locally stored FASTA and GTF files (and an optional file with splice junctions) for STAR index generation on the fly, as well as a locally stored seed file and a source directory to build a BSgenome R package. The package will be automatically installed within the CAGEr container from the autogenerated .tar.gz archive and used with CAGEr. To run this example, the user needs to provide full paths to the test FASTQ files in samplesheet_sacer_se_template.csv, as well as paths to a locally stored seed file and a source directory:
nextflow run customcageq/main.nf \
--forgeseed /path/to/bsgenome_forging.seed \
--sourcedir /path/to/seqs_srcdir \
--gtf customcageq/assets/sacCer3_genome/sacCer3.ensGene.gtf \
--fasta customcageq/assets/sacCer3_genome/sacCer3.fa \
--gtf customcageq/assets/sacCer3_genome/sacCer3.ensGene.gtf \
--infolder customcageq/assets/sacCer_fastq/pe_per_sample/ \
-profile singularitywhere the pe_per_sample input directory has the following structure:
customcageq/assets/sacCer_fastq/pe_per_sample/
├── S10
│ ├── S10_S6_L001_R1_001.fastq.gz
│ └── S10_S6_L001_R2_001.fastq.gz
└── S14
├── S14_S8_L001_R1_001.fastq.gz
├── S14_S8_L001_R2_001.fastq.gz
├── S14_S8_L002_R1_001.fastq.gz
└── S14_S8_L002_R2_001.fastq.gzThis example may suit for the processing of CAGE data from a new species for which the user has to build (“forge”) a BSgenome package by themselves. After forging the BSgenome once, the user can copy the resulting .tar.gz file from results/bsgenome and reuse it in subsequent runs of the pipeline by setting the --bsgenome option, for example: --bsgenome /path/to/bsgenome/BSgenome.Scerevisiae.UCSC.sacCer3_1.4.0.tar.gz.
CAGEscan libraries
Call TSSs from FANTOM5 CAGEscan libraries (see, for example, CAGEscan datasets from human primary cells by FANTOM5). These libraries require trimming of 9 nt from the 5’-ends of the forward reads and of 6 nt from the 5’-ends of the reverse reads and do not seem to require separate G-trimming (Bertin et al., 2017). To run this example, the user needs to generate the fantom5_cagescan_pe.csv input table (see above) and provide a path to it and to the STAR index of the T2T-CHM13 v2.0 human genome assembly:
nextflow run customcageq/main.nf \
--bsgenome BSgenome.Hsapiens.NCBI.T2T.CHM13v2.0 \
--gtf /path/to/chm13_t2t.gtf \
--index /path/to/chm13_t2t_v2.0_star_index \
--params-trimgalore '--clip_R1 9 --clip_R2 6' \
--nogtrim \
--samplesheet /path/to/fantom5_cagescan_pe.csv \
-profile singularityNot recommended. Paired-end reads with bowtie2 and duplicate removal
Call TSSs from the test yeast paired-end CAGE reads using bowtie2 for read mapping. Additionally, remove PCR duplicates and optical duplicates at a maximum distance 100 (see samtools markdup) before doing alignment QC and TSS calling:
nextflow run customcageq/main.nf \
--bsgenome BSgenome.Scerevisiae.UCSC.sacCer3 \
--gtf customcageq/assets/sacCer3_genome/sacCer3.ensGene.gtf \
--bowtie2 \
--index customcageq/assets/sacCer3_genome/sacCer3_bowtie2_index \
--dedup \
--dist 100 \
--input customcageq/assets/samplesheet_sacer_pe_template.csv \
-profile singularityThis example collects options that are not recommended but retained just in case. Using bowtie2 does not allow accounting for splicing and makes CAGEr use BAM files, which slows the creation of the CAGEexp object considerably. Also, read deduplication is not recommended because CAGE reads, by design, come only from transcripts, with one read coming from the 5’-end of the transcript, which increases the probability of true duplicates, in comparison to whole-genome libraries, like ChIP-seq or ATAC-seq. The --bowtie2 option can also be used with the --fasta option to build a bowtie2 genome index on the fly, while the --dedup and --dist options can be used with the default STAR mapping.