pangenome: Parameters

Define where the pipeline should find input data and save output data.

Path to BGZIPPED input FASTA to build the pangenome graph from.

required

type: string

pattern: ^\S+\.fn?a(sta)?(\.gz)?$

A FASTA file containing the sequences to build the pangenome graph from. Each sequence can be a full chromosome, a contig, or a very long read. The FASTA file must be BGZIPPED or WFMASH won't be able to process it. If you have your sequences in FASTA format, you can run:

bgzip <SEQUENCES.fa> @<THREADS> > <SEQUENCES.fa.gz>
samtools faidx <SEQUENCES.fa.gz>

In order to ensure the most compatible functionality, please format your sequence identifiers so that they follow the https://github.com/pangenome/PanSN-spec.

pattern: ^\S+.fn?a(sta)?(.gz)?$

The number of haplotypes in the input FASTA.

required

type: number

The constructed graph is defined by the number of mappings per segment of each genome (--n_haplotypes <N> - 1). Ideally, you should set this to equal the number of haplotypes in the pangenome. Because that's the maximum number of secondary mappings and alignments that we expect. Keep in mind that the total work of alignment is proportional to N*N, and these multimappings can be highly redundant.

The output directory where the results will be saved. You have to use absolute paths to storage on Cloud infrastructure.

required

type: string

Email address for completion summary.

type: string

pattern: ^([a-zA-Z0-9_\-\.]+)@([a-zA-Z0-9_\-\.]+)\.([a-zA-Z]{2,5})$

Set this parameter to your e-mail address to get a summary e-mail with details of the run sent to you when the workflow exits. If set in your user config file (~/.nextflow/config) then you don't need to specify this on the command line for every run.

MultiQC report title. Printed as page header, used for filename if not otherwise specified.

type: string

Options for the all versus all alignment phase.

Percent identity in the wfmash mashmap step.

type: number

default: 90

Use mash dist or mash triangle to explore the typical level of divergence between the sequences in your input (see https://pggb.readthedocs.io/en/latest/rst/tutorials/divergence_estimation.html#divergence-estimation for more information). Convert this to an approximate percent identity and provide it as --wfmash_map_pct_id <PCT>. A list of examples can be found at https://github.com/pangenome/pggb#example-builds-for-diverse-species.

Segment length for mapping.

type: string

default: 5000

Crucially, --wfmash_segment_length provides a kind of minimum alignment length filter. The mashmap3 step in wfmash will only consider segments of this size. For small pangenome graphs, or where there are few repeats, --wfmash_segment_length can be set low (for example 500 when building a MHC pangenome graph). However, for larger contexts, with repeats, it can be very important to set this high (for instance 50k in the case of human genomes). A long segment length ensures that we represent long collinear regions of the input sequences in the structure of the graph. In general, this should at least be larger than transposons and other common repeats in your pangenome. A list of examples can be found at https://github.com/pangenome/pggb#example-builds-for-diverse-species.

This parameter must be a combination of the following values: 1, 9, d, kKmMgGtT, 0

Minimum block length filter for mapping.

type: string

By default, wfmash only keeps mappings with at least 5 times the size of a segment. This can be adjusted with --wfmash_block_length <BLOCK_LENGTH>.

This parameter must be a combination of the following values: 1, 9, d, kKmMgGtT, 0

Kmer size for mashmap.

type: integer

default: 19

Ignore the top % most-frequent kmers.

type: number

default: 0.001

Keep this fraction of mappings (auto for giant component heuristic).

type: string

default: 1.0

pattern: (auto|[01]\.\d+)

Merge successive mappings.

type: boolean

Disable splitting of input sequences during mapping.

hidden

type: boolean

Skip mappings between sequences with the same name prefix before the given delimiter character. This can be helpful if several sequences originate from the same chromosome. It is recommended that the sequence names respect the https://github.com/pangenome/PanSN-spec. In future versions of the pipeline it will be required that the sequence names follow this specification.

type: string

Set the directory where temporary files should be stored. Since everything runs in containers, we don't usually set this argument.

hidden

type: string

The number of files to generate from the approximate wfmash mappings to scale across a whole cluster. It is recommended to set this to the number of available nodes. If only one machine is available, leave it at 1.

type: integer

default: 1

This Nextflow pipeline version's major advantage is that it can distribute the usually computationally heavy all versus all alignment step across a whole cluster. It is capable of splitting the initial approximate alignments into problems of equal size. The base-level alignments are then distributed across several processes. Assuming you have a cluster with 10 nodes and you are the only one using it, we would recommend to set --wfmash_chunks 10. If you have a cluster with 20 nodes, but you have to share it with others, maybe setting it to --wfmash_chunks 10 could be a good fit, because then you don't have to wait too long for your jobs to finish.

If this parameter is set, only the wfmash alignment step of the pipeline is executed. This option is offered for users who want to run wfmash on a cluster.

type: boolean

Filter out mappings unlikely to be this Average Nucleotide Identity (ANI) less than the best mapping.

type: integer

default: 30

Number of mappings for each segment. [default: n_haplotypes - 1].

type: integer

Ignores exact matches below this length.

type: integer

default: 23

Graph induction with seqwish often works better when we filter very short matches out of the input alignments. In practice, these often occur in regions of low alignment quality, which are typical of areas with large INDELs and structural variations in the wfmash alignments. This underalignment is then resolved in the smoothxg step. Removing short matches can simplify the graph and remove spurious relationships caused by short repeated homologies.
A setting of --seqwish_min_match_length 47 is optimal for around 5% divergence, and we suggest lowering it for higher divergence and increasing it for lower divergence. Values up to --seqwish_min_match_length 311 work well for human haplotypes. In effect, setting --seqwish_min_match_length to N means that we can tolerate a local pairwise difference rate of no more than 1/N. Thus, INDELs which may be represented by complex series of edit operations will be opened into bubbles in the induced graph, and alignment regions with very low identity will be ignored. Using affine-gapped alignment (such as with minimap2) may reduce the impact of this step by representing large indels more precisely in the input alignments. However, it remains important due to local inconsistency in alignments in low-complexity sequence.

Number of base pairs to use for transitive closure batch.

type: string

default: 10000000

If you run out of memory during the seqwish step, you can lower this value. It will take longer, but it will use less memory.

This parameter must be a combination of the following values: 1, 9, d, kKmMgGtT, 0

Keep this randomly selected fraction of input matches.

type: number

Set the directory where temporary files should be stored. Since everything runs in containers, we don't usually set this argument.

hidden

type: string

Input PAF file. The wfmash alignment step is skipped.

type: string

Options for graph smoothing phase.

Skip the graph smoothing step of the pipeline.

type: boolean

Maximum path jump to include in the block.

hidden

type: integer

Maximum edge jump before a block is broken.

hidden

type: integer

Maximum sequence length to put int POA. Is a comma-separated list. For each integer, SMOOTHXG wil be executed once.

type: string

default: 700,900,1100

The last step in smoothxg refines the graph by running a partial order alignment (POA) across segments, so called blocks. The "chunked" POA process attempts to build an MSA for each collinear region in the sorted graph.
The length of these sub-problems greatly affects the total time and memory requirements of the pipeline, and is defined by -- smoothxg_poa_length <LEN1,LEN2,...>. Several passes of refinement can be defined by lengths >LEN1,LEN2,...>, and so on. Ideally, this target can be set above the length of transposon repeats in the pangenome, and base-level graph quality tends to improve as it is set higher. Higher values makes sense for lower-diversity pangenomes, but can require several GB of RAM per thread.

Minimum edit-based identity to cluster sequences.

hidden

type: string

Minimum 'smallest / largest' sequence length ration to cluster in a block.

hidden

type: integer

Path depth at which we don't pad the POA problem.

type: integer

default: 100

Pad each end of each seuqence in POA with 'smoothxg_poa_padding * longest_poa_seq' base pairs.

type: number

default: 0.001

Score parameters for POA in the form of 'match,mismatch,gap1,ext1,gap2,ext2'. It may also be given as presets: 'asm5', 'asm10', 'asm15', 'asm20'. [default: 1,19,39,3,81,1 = asm5].

type: string

default: 1,19,39,3,81,1

For details of the different asm modes, please take a look at https://pggb.readthedocs.io/en/latest/rst/optional_parameters.html#homogenizing-and-ordering-the-graph.

Write MAF output representing merged POA blocks.

type: boolean

Use this prefix for consensus path names.

hidden

type: string

default: Consensus_

Set the directory where temporary files should be stored. Since everything runs in containers, we don't usually set this argument.

hidden

type: string

Keep intermediate graphs during SMOOTHXG.

hidden

type: boolean

Run abPOA. [default: SPOA].

type: boolean

Run the POA in global mode. [default: local mode].

type: boolean

Number of CPUs for the potentially very memory expensive POA phase of SMOOTHXG. Default is 'task.cpus'.

type: integer

Options for calling variants against reference(s).

Specify a set of VCFs to produce with --vcf_spec "REF[:LEN][,REF[:LEN]]*".

type: string

The paths matching ^REF are used as a reference, while the sample haplotypes are derived from path names, e.g. when DELIM=# and with -V chm13, a path named HG002#1#ctg would be assigned to sample HG002 phase 1. If LEN is specified and greater than 0, the VCFs are decomposed, filtering sites whose max allele length is greater than LEN.

Options to run the partition algorithm for community detection.

Enable community detection.

type: boolean

Pangenome graphs can represent all mutual alignments of collections of sequences. However, we can't really expect to pairwise map all sequences together and obtain well separated connected components. It is likely to get a giant connected component, and probably a few smaller ones, due to incorrect mappings or false homologies. This might unnecessarily increase the computational burden, as well as complicate the downstream analyzes. Therefore, it is recommended to split up the input sequences into communities in order to find the latent structure of their mutual relationship. For example, the communities can represent the different chromosomes of the input genomes.
<Warning> If you know in advance that your sequences present particular rearrangements (like rare chromosome translocations), you might consider skipping this step or tuning it accordingly to your biological questions.

Parameters used to describe centralised config profiles. These should not be edited.

Git commit id for Institutional configs.

hidden

type: string

default: master

Base directory for Institutional configs.

hidden

type: string

default: https://raw.githubusercontent.com/nf-core/configs/master

If you're running offline, Nextflow will not be able to fetch the institutional config files from the internet. If you don't need them, then this is not a problem. If you do need them, you should download the files from the repo and tell Nextflow where to find them with this parameter.

Institutional config name.

hidden

type: string

Institutional config description.

hidden

type: string

Institutional config contact information.

hidden

type: string

Institutional config URL link.

hidden

type: string

Set the top limit for requested resources for any single job.

Maximum number of CPUs that can be requested for any single job.

hidden

type: integer

default: 16

Use to set an upper-limit for the CPU requirement for each process. Should be an integer e.g. --max_cpus 1

Maximum amount of memory that can be requested for any single job.

hidden

type: string

default: 128.GB

pattern: ^\d+(\.\d+)?\.?\s*(K|M|G|T)?B$

Use to set an upper-limit for the memory requirement for each process. Should be a string in the format integer-unit e.g. --max_memory '8.GB'

Maximum amount of time that can be requested for any single job.

hidden

type: string

default: 240.h

pattern: ^(\d+\.?\s*(s|m|h|d|day)\s*)+$

Use to set an upper-limit for the time requirement for each process. Should be a string in the format integer-unit e.g. --max_time '2.h'

Less common options for the pipeline, typically set in a config file.

Display help text.

hidden

type: boolean

Display version and exit.

hidden

type: boolean

Method used to save pipeline results to output directory.

hidden

type: string

The Nextflow publishDir option specifies which intermediate files should be saved to the output directory. This option tells the pipeline what method should be used to move these files. See Nextflow docs for details.

Email address for completion summary, only when pipeline fails.

hidden

type: string

pattern: ^([a-zA-Z0-9_\-\.]+)@([a-zA-Z0-9_\-\.]+)\.([a-zA-Z]{2,5})$

An email address to send a summary email to when the pipeline is completed - ONLY sent if the pipeline does not exit successfully.

Send plain-text email instead of HTML.

hidden

type: boolean

File size limit when attaching MultiQC reports to summary emails.

hidden

type: string

default: 25.MB

pattern: ^\d+(\.\d+)?\.?\s*(K|M|G|T)?B$

Do not use coloured log outputs.

hidden

type: boolean

Incoming hook URL for messaging service

hidden

type: string

Incoming hook URL for messaging service. Currently, MS Teams and Slack are supported.

Custom config file to supply to MultiQC.

hidden

type: string

Custom logo file to supply to MultiQC. File name must also be set in the MultiQC config file

hidden

type: string

Custom MultiQC yaml file containing HTML including a methods description.

type: string

Boolean whether to validate parameters against the schema at runtime

hidden

type: boolean

default: true

Show all params when using --help

hidden

type: boolean

By default, parameters set as hidden in the schema are not shown on the command line when a user runs with --help. Specifying this option will tell the pipeline to show all parameters.

Validation of parameters fails when an unrecognised parameter is found.

hidden

type: boolean

By default, when an unrecognised parameter is found, it returns a warinig.

Validation of parameters in lenient more.

hidden

type: boolean

default: true

Allows string values that are parseable as numbers or booleans. For further information see JSONSchema docs.

Do we want to display hidden parameters?

hidden

type: boolean

Do we want to display hidden parameters?

hidden

type: string

default: igenomes_base

Do we want to display hidden parameters?

nf-core/pangenome