nf-core/pangenome
Renders a collection of sequences into a pangenome graph. https://doi.org/10.1093/bioinformatics/btae609.
Define where the pipeline should find input data and save output data.
Path to BGZIPPED input FASTA to build the pangenome graph from.
string
^\S+\.fn?a(sta)?(\.gz)?$
A FASTA file containing the sequences to build the pangenome graph from. Each sequence can be a full chromosome, a contig, or a very long read. The FASTA file must be BGZIPPED or WFMASH won't be able to process it. If you have your sequences in FASTA format, you can run:
bgzip <SEQUENCES.fa> @<THREADS> > <SEQUENCES.fa.gz>
samtools faidx <SEQUENCES.fa.gz>
In order to ensure the most compatible functionality, please format your sequence identifiers so that they follow the https://github.com/pangenome/PanSN-spec.
pattern: ^\S+.fn?a(sta)?(.gz)?$
The number of haplotypes in the input FASTA.
number
The constructed graph is defined by the number of mappings per segment of each genome (--n_haplotypes <N> - 1). Ideally, you should set this to equal the number of haplotypes in the pangenome. Because that's the maximum number of secondary mappings and alignments that we expect. Keep in mind that the total work of alignment is proportional to N*N, and these multimappings can be highly redundant.
The output directory where the results will be saved. You have to use absolute paths to storage on Cloud infrastructure.
string
Email address for completion summary.
string
^([a-zA-Z0-9_\-\.]+)@([a-zA-Z0-9_\-\.]+)\.([a-zA-Z]{2,5})$
Set this parameter to your e-mail address to get a summary e-mail with details of the run sent to you when the workflow exits. If set in your user config file (~/.nextflow/config
) then you don't need to specify this on the command line for every run.
MultiQC report title. Printed as page header, used for filename if not otherwise specified.
string
Options for the all versus all alignment phase.
Percent identity in the wfmash mashmap step.
number
90
Use mash dist
or mash triangle
to explore the typical level of divergence between the sequences in your input (see https://pggb.readthedocs.io/en/latest/rst/tutorials/divergence_estimation.html#divergence-estimation for more information). Convert this to an approximate percent identity and provide it as --wfmash_map_pct_id <PCT>. A list of examples can be found at https://github.com/pangenome/pggb#example-builds-for-diverse-species.
Segment length for mapping.
string
5000
Crucially, --wfmash_segment_length provides a kind of minimum alignment length filter. The mashmap3
step in wfmash
will only consider segments of this size. For small pangenome graphs, or where there are few repeats, --wfmash_segment_length can be set low (for example 500 when building a MHC pangenome graph). However, for larger contexts, with repeats, it can be very important to set this high (for instance 50k in the case of human genomes). A long segment length ensures that we represent long collinear regions of the input sequences in the structure of the graph. In general, this should at least be larger than transposons and other common repeats in your pangenome. A list of examples can be found at https://github.com/pangenome/pggb#example-builds-for-diverse-species.
1
, 9
, d
, kKmMgGtT
, 0
Minimum block length filter for mapping.
string
By default, wfmash only keeps mappings with at least 5 times the size of a segment. This can be adjusted with --wfmash_block_length <BLOCK_LENGTH>.
This parameter must be a combination of the following values:1
, 9
, d
, kKmMgGtT
, 0
Kmer size for mashmap.
integer
19
Ignore the top % most-frequent kmers.
number
0.001
Keep this fraction of mappings (auto
for giant component heuristic).
string
1.0
(auto|[01]\.\d+)
Merge successive mappings.
boolean
Disable splitting of input sequences during mapping.
boolean
Skip mappings between sequences with the same name prefix before the given delimiter character. This can be helpful if several sequences originate from the same chromosome. It is recommended that the sequence names respect the https://github.com/pangenome/PanSN-spec. In future versions of the pipeline it will be required that the sequence names follow this specification.
string
Set the directory where temporary files should be stored. Since everything runs in containers, we don't usually set this argument.
string
The number of files to generate from the approximate wfmash mappings to scale across a whole cluster. It is recommended to set this to the number of available nodes. If only one machine is available, leave it at 1.
integer
1
This Nextflow pipeline version's major advantage is that it can distribute the usually computationally heavy all versus all alignment step across a whole cluster. It is capable of splitting the initial approximate alignments into problems of equal size. The base-level alignments are then distributed across several processes. Assuming you have a cluster with 10 nodes and you are the only one using it, we would recommend to set --wfmash_chunks 10. If you have a cluster with 20 nodes, but you have to share it with others, maybe setting it to --wfmash_chunks 10 could be a good fit, because then you don't have to wait too long for your jobs to finish.
If this parameter is set, only the wfmash alignment step of the pipeline is executed. This option is offered for users who want to run wfmash on a cluster.
boolean
Filter out mappings unlikely to be this Average Nucleotide Identity (ANI) less than the best mapping.
integer
30
Number of mappings for each segment. [default: n_haplotypes - 1
].
integer
Ignores exact matches below this length.
integer
23
Graph induction with seqwish often works better when we filter very short matches out of the input alignments. In practice, these often occur in regions of low alignment quality, which are typical of areas with large INDELs and structural variations in the wfmash alignments. This underalignment is then resolved in the smoothxg step. Removing short matches can simplify the graph and remove spurious relationships caused by short repeated homologies.
A setting of --seqwish_min_match_length 47 is optimal for around 5% divergence, and we suggest lowering it for higher divergence and increasing it for lower divergence. Values up to --seqwish_min_match_length 311 work well for human haplotypes. In effect, setting --seqwish_min_match_length to N means that we can tolerate a local pairwise difference rate of no more than 1/N. Thus, INDELs which may be represented by complex series of edit operations will be opened into bubbles in the induced graph, and alignment regions with very low identity will be ignored. Using affine-gapped alignment (such as with minimap2) may reduce the impact of this step by representing large indels more precisely in the input alignments. However, it remains important due to local inconsistency in alignments in low-complexity sequence.
Number of base pairs to use for transitive closure batch.
string
10000000
If you run out of memory during the seqwish step, you can lower this value. It will take longer, but it will use less memory.
This parameter must be a combination of the following values:1
, 9
, d
, kKmMgGtT
, 0
Keep this randomly selected fraction of input matches.
number
Set the directory where temporary files should be stored. Since everything runs in containers, we don't usually set this argument.
string
Input PAF file. The wfmash alignment step is skipped.
string
Options for graph smoothing phase.
Skip the graph smoothing step of the pipeline.
boolean
Maximum path jump to include in the block.
integer
Maximum edge jump before a block is broken.
integer
Maximum sequence length to put int POA. Is a comma-separated list. For each integer, SMOOTHXG wil be executed once.
string
700,900,1100
The last step in smoothxg refines the graph by running a partial order alignment (POA) across segments, so called blocks. The "chunked" POA process attempts to build an MSA for each collinear region in the sorted graph.
The length of these sub-problems greatly affects the total time and memory requirements of the pipeline, and is defined by -- smoothxg_poa_length <LEN1,LEN2,...>. Several passes of refinement can be defined by lengths >LEN1,LEN2,...>, and so on. Ideally, this target can be set above the length of transposon repeats in the pangenome, and base-level graph quality tends to improve as it is set higher. Higher values makes sense for lower-diversity pangenomes, but can require several GB of RAM per thread.
Minimum edit-based identity to cluster sequences.
string
Minimum 'smallest / largest' sequence length ration to cluster in a block.
integer
Path depth at which we don't pad the POA problem.
integer
100
Pad each end of each seuqence in POA with 'smoothxg_poa_padding * longest_poa_seq' base pairs.
number
0.001
Score parameters for POA in the form of 'match,mismatch,gap1,ext1,gap2,ext2'. It may also be given as presets: 'asm5', 'asm10', 'asm15', 'asm20'. [default: 1,19,39,3,81,1 = asm5].
string
1,19,39,3,81,1
For details of the different asm modes, please take a look at https://pggb.readthedocs.io/en/latest/rst/optional_parameters.html#homogenizing-and-ordering-the-graph.
Write MAF output representing merged POA blocks.
boolean
Use this prefix for consensus path names.
string
Consensus_
Set the directory where temporary files should be stored. Since everything runs in containers, we don't usually set this argument.
string
Keep intermediate graphs during SMOOTHXG.
boolean
Run abPOA. [default: SPOA].
boolean
Run the POA in global mode. [default: local mode].
boolean
Number of CPUs for the potentially very memory expensive POA phase of SMOOTHXG. Default is 'task.cpus'.
integer
Options for calling variants against reference(s).
Specify a set of VCFs to produce with --vcf_spec "REF[:LEN][,REF[:LEN]]*"
.
string
The paths matching ^REF
are used as a reference, while the sample haplotypes are derived from path names, e.g. when DELIM=#
and with -V chm13
, a path named HG002#1#ctg would be assigned to sample HG002 phase 1. If LEN
is specified and greater than 0, the VCFs are decomposed, filtering sites whose max allele length is greater than LEN
.
Options to run the partition algorithm for community detection.
Enable community detection.
boolean
Pangenome graphs can represent all mutual alignments of collections of sequences. However, we can't really expect to pairwise map all sequences together and obtain well separated connected components. It is likely to get a giant connected component, and probably a few smaller ones, due to incorrect mappings or false homologies. This might unnecessarily increase the computational burden, as well as complicate the downstream analyzes. Therefore, it is recommended to split up the input sequences into communities in order to find the latent structure of their mutual relationship. For example, the communities can represent the different chromosomes of the input genomes.
<Warning> If you know in advance that your sequences present particular rearrangements (like rare chromosome translocations), you might consider skipping this step or tuning it accordingly to your biological questions.
Parameters used to describe centralised config profiles. These should not be edited.
Git commit id for Institutional configs.
string
master
Base directory for Institutional configs.
string
https://raw.githubusercontent.com/nf-core/configs/master
If you're running offline, Nextflow will not be able to fetch the institutional config files from the internet. If you don't need them, then this is not a problem. If you do need them, you should download the files from the repo and tell Nextflow where to find them with this parameter.
Institutional config name.
string
Institutional config description.
string
Institutional config contact information.
string
Institutional config URL link.
string
Less common options for the pipeline, typically set in a config file.
Display version and exit.
boolean
Method used to save pipeline results to output directory.
string
The Nextflow publishDir
option specifies which intermediate files should be saved to the output directory. This option tells the pipeline what method should be used to move these files. See Nextflow docs for details.
Email address for completion summary, only when pipeline fails.
string
^([a-zA-Z0-9_\-\.]+)@([a-zA-Z0-9_\-\.]+)\.([a-zA-Z]{2,5})$
An email address to send a summary email to when the pipeline is completed - ONLY sent if the pipeline does not exit successfully.
Send plain-text email instead of HTML.
boolean
File size limit when attaching MultiQC reports to summary emails.
string
25.MB
^\d+(\.\d+)?\.?\s*(K|M|G|T)?B$
Do not use coloured log outputs.
boolean
Incoming hook URL for messaging service
string
Incoming hook URL for messaging service. Currently, MS Teams and Slack are supported.
Custom config file to supply to MultiQC.
string
Custom logo file to supply to MultiQC. File name must also be set in the MultiQC config file
string
Custom MultiQC yaml file containing HTML including a methods description.
string
Boolean whether to validate parameters against the schema at runtime
boolean
true
Base URL or local path to location of pipeline test dataset files
string
https://raw.githubusercontent.com/nf-core/test-datasets/
Do we want to display hidden parameters?
boolean
Do we want to display hidden parameters?
string
igenomes_base
Do we want to display hidden parameters?
Suffix to add to the trace report filename. Default is the date and time in the format yyyy-MM-dd_HH-mm-ss.
string