Module Specifications
The key words “MUST”, “MUST NOT”, “SHOULD”, etc. are to be interpreted as described in RFC 2119.
1 General
1.1 Required and optional input files
All mandatory and optional input files MUST be included in input
channel definitions.
1.2 Non-file mandatory command arguments
Non-file mandatory arguments or arguments needed to modify the command to make the module run with no error, SHOULD be provided as value channels (for example lib_type
in salmon/quant) - see ‘Input/output options’ below.
1.3 Optional command arguments
All non-mandatory command-line tool non-file arguments MUST be provided as a string via the $task.ext.args
variable.
-
The value of
task.ext.args
is supplied from themodules.config
file by assigning a closure that returns a string value toext.args
. The closure is necessary to update parameters supplied in a config with-c
.
A disadvantage of passing arguments via ext.args is that it splits up how information is passed to a module, which can be difficult to understand where module inputs are defined.
The justification behind using the ext.args
is to provide more flexibility to users.
As ext.args
is derived from the configuration (e.g. modules.config
), advanced users can overwrite the default ext.args
and supply their own arguments to modify the behaviour of a module.
This can increase the capabilities of a pipeline beyond what the original developers intended.
Initially these were passed via the main workflow script using custom functions (e.g. addParams
) and other additional nf-core custom methods, but this had a syntax overhead and other limitations that were found to be more difficult to use and understand by pipeline developers.
Therefore using the ‘native’ ext
functionality provided by Nextflow was easier to understand, maintain and use.
Note that sample-specific parameters can still be provided to an instance of a process by storing these in meta
, and providing these to the ext.args
definition in modules.config
.
A closure is used to make Nextflow evaluate the code in the string.
1.4 Use of multi-command piping
Software that can be piped together SHOULD be added to separate module files unless there is a run-time, storage advantage in implementing in this way.
For example,
using a combination of bwa
and samtools
to output a BAM file instead of a SAM file:
The addition of multi-tool modules to nf-core/modules adds increased burden on the nf-core
maintainers.
Where possible, if a multi-tool module is desired, it should be implemented as a local module in the nf-core pipeline.
If another nf-core pipeline also desires to use this module, a PR can be made adding it to nf-core/modules.
For guidelines regarding multi-tool modules, please search this page for the phrase multi-tool
.
Existing local multi-tool modules can be searched for using the Github search box, searching across the nf-core org for terms such as args2
samtools
collate
fastq
.
Modules intended to batch process files by parallelizing repeated calls to a tool, for example with
xargs
or parallel
, also fall under the category of multi-tool modules.
Multi-tool modules
should chain tools in an explicit order given by the module name, e.g. SAMTOOLS/COLLATEFASTQ
.
1.5 Each command must have an $args variable
Each tool in a multi-tool module MUST have an $args
e.g.,
or
The numbering of each $args
variable MUST correspond to the order of the tools, and MUST be documented in meta.yml
.
E.g. in the first example, bwa mem
is the first tool so is given $args
, samtools view
is the second tool so is $args2
, etc.
1.6 Types of meta fields
Modules MUST NOT use ‘custom’ hardcoded meta
fields.
The only accepted ‘standard’ meta fields are meta.id
or meta.single_end
.
Proposals for other ‘standard’ fields for other disciplines must be discussed with the maintainers team.
Modules should be written to allow as much flexibility to pipeline developers as possible.
Hardcoding meta
fields in a module will reduce the freedom of developers to use their own names for metadata, which would make more sense in that particular context.
As all non-mandatory arguments MUST go via $args
, pipeline developers can insert such meta
information into $args
with whatever name they wish.
So, in the module code DO NOT:
… but rather:
and then in the module code:
1.7 Compression of input and output files
Where applicable, the usage and generation of compressed files SHOULD be enforced as input and output, respectively:
*.fastq.gz
and NOT*.fastq
*.bam
and NOT*.sam
If a tool does not support compressed input or output natively, we RECOMMEND passing the uncompressed data via unix pipes, such that it never gets written to disk, e.g.
The -f
option makes gzip
auto-detect if the input is compressed or not.
If a tool cannot read from STDIN, or has multiple input files, it is possible to use named pipes:
Only if a tool reads the input multiple times, it is required to uncompress the file before running the tool.
1.8 Emission of versions
Where applicable, each module command MUST emit a file versions.yml
containing the version number for each tool executed by the module, e.g.
resulting in, for instance,
All reported versions MUST be without a leading v
or similar (i.e. must start with a numeric character), or for unversioned software, a Git SHA commit id (40 character hexadecimal string).
sed
is a powerful stream editor that can be used to manipulate the input text into the desired output.
Start by piping the output of the version command to sed
and try to select the line with the version number:
sed '1!d'
Extracts only line 1 of the output printed bytools --version
.- The line to process can also be selected using a pattern instead of a number:
sed '/pattern/!d'
, e.g.sed '/version:/!d'
. - If the line extraction hasn’t worked, then it’s likely the version information is written to stderr, rather than stdout.
In this case capture stderr using
|&
which is shorthand for2>&1 |
. sed 's/pattern/replacement/'
can be used to remove parts of a string..
matches any character,+
matches 1 or more times.- You can separate
sed
commands using;
. Often the pattern :sed 'filter line ; replace string'
is enough to get the version number. - It is not necessary to use
echo
,head
,tail
, orgrep
. - Use
|| true
for tools that exit with a non-zero error code:command --version || true
orcommand --version | sed ... || true
.
We chose a HEREDOC over piping into the versions file line-by-line as we believe the latter makes it easy to accidentally overwrite the file. Moreover, the exit status of the sub-shells evaluated in within the HEREDOC is ignored, ensuring that a tool’s version command does no erroneously terminate the module.
If the software is unable to output a version number on the command-line then a variable called VERSION
can be manually specified to provide this information e.g. homer/annotatepeaks module.
Please include the accompanying comments above the software packing directives and beside the version string.
If the HEREDOC cannot be used because the script is not bash, the versions.yml
MUST be written directly e.g. ascat module.
1.9 Presence of when statement
The process definition MUST NOT change the when
statement.
when
conditions can instead be supplied using the process.ext.when
directive in a configuration file.
1.10 Capturing STDOUT and STDERR
In some cases, STDOUT and STDERR may need to be saved to file, for example for reporting purposes.
Use the shell command tee
to simultaneously capture and preserve the streams.
This allows for the streams to be captured by the job scheduler’s stream logging capabilities and print them to screen when Nextflow encounters an error.
This also ensures that they are captured by Nextflow.
If information is only written to files, it could potentially be lost when the job scheduler gives up the job allocation.
Similarly, if the tool captures STDOUT or STDERR to a file itself, it is best to send those to the corresponding streams as well. Since a timeout may mean execution is aborted, it may make most sense to have background tasks do that.
1.11 Capturing exit codes
Occasionally, some tools do not exit with the expected exit code 0 upon successful use of the tool.
In these cases one can use the ||
operator to run another useful command when the exit code is not 0 (for example, testing if a file is not size 0).
See the Bash manual on file operators for examples of properties of files which could be tested.
Alternate suggestions include using grep -c
to search for a valid string match, or other tool which will appropriately error when the expected output is not successfully created.
1.12 Stubs
1.12.1 Stub block must exist
A stub block MUST exist for all modules. This is a block of code that replaces the script
command when the option -stub
is set. This enables quick testing of the workflow logic, as a “dry-run”.
1.12.2 Stub block prefix and versions
The stub block MUST include the same variables (e.g. prefix
) and HEREDOC code as the main script block.
1.12.3 Stub files for all output channels
The stub block MUST include the creation of at least one file for every output channel (both mandatory and optional), generated with touch, e.g.
Ideally, the stub block should reproduce as much as possible the number of, and filenames structure, of the files expected as output.
1.12.4 Stub gzip files must use echo and pipe
Stub files that should be output as gzip compressed, MUST use the syntax in the following example:
Simply touching a file with the file name ending in .gz
will break nf-test’s Gzip file parser, as the file is not actually gzipped and thus cannot be read.
Therefore we must make sure we generate a valid gzipped file for nf-test to accept it during tests.
2 Naming conventions
2.1 Name format of module files
The directory structure for the module name must be all lowercase, and without punctuation, e.g. modules/nf-core/bwa/mem/
. The name of the software (i.e. bwa
) and tool (i.e. mem
) MUST be all one word.
Note that nf-core/tools will validate your suggested name.
2.2 Name format of module processes
The process name in the module file MUST be all uppercase e.g. process BWA_MEM {
. The name of the software (i.e. BWA
) and tool (i.e. MEM
) MUST be all one word separated by an underscore.
2.3 Name format of module parameters
All parameter names MUST follow the snake_case
convention.
2.4 Name format of module functions
All function names MUST follow the camelCase
convention.
2.5 Name format of module channels
Channel names MUST follow snake_case
convention and be all lower case.
2.6 Command file output naming
Output file (and/or directory) names SHOULD just consist of only ${prefix}
and the file-format suffix (e.g. ${prefix}.fq.gz
or ${prefix}.bam
).
-
This is primarily for re-usability so that other developers have complete flexibility to name their output files however they wish when using the same module.
-
As a result of using this syntax, if the module has the same named inputs and outputs then you can add a line in the
script
section like below (another example here) which will raise an error asking the developer to change theargs.prefix
variable to rename the output files so they don’t clash.
3 Input/output options
3.1 Required path
channel inputs
Input channel path
declarations MUST be defined for all possible input files (i.e. both required and optional files).
- Directly associated auxiliary files to an input file MAY be defined within the same input channel alongside the main input channel (e.g. BAM and BAI).
- Other generic auxiliary files used across different input files (e.g. common reference sequences) MAY be defined using a dedicated input channel (e.g. reference files).
3.2 Required val
channel inputs
Input channel val
declarations SHOULD be defined for all mandatory non-file inputs that are essential for the functioning of the tool (e.g. parameters, flags etc).
- Mandatory non-file inputs are options that the tool MUST have to be able to be run.
- These non-file inputs are typically booleans or strings, and must be documented as such in the corresponding entry in the
meta.yaml
. - Options, flags, parameters that are not required by the tool to function should NOT be included - rather these can be passed via
ext.args
.
It was decided by a vote amongst interested parties within the 2023 Maintainers group on 2023-02-28 to allow non-file mandatory input channels.
The reasoning behind this was that it is important to have documented (using the existing display on the website) the bare minimum information required for a module to run.
It also allows module code to consume parameter values without parsing them out of the ext.args
string and reduces possible risks of entire breakage of modules with future expected config changes at a Nextflow level.
Downsides to this approach are readability (now multiple places must be checked on how to modify a module execution - modules.conf ext.args
, the module invocation in pipeline code etc.), and reduced user freedom.
However it was felt that it was more important for stability in and ‘installation’ and ‘execution’ of modules was preferred (e.g. for tools that require position arguments etc.)
When one and only one of multiple argument are required:
-
If they all are string argument : use 1 argument that will be equal to the string
e.g. Parameter model of glimpse2 chunk
-
If some are files put them all in one channel and test if only one is present
e.g. Grouping output parameters of glimpse2 concordance
if (((file1 ? 1:0) + (val1 ? 1:0) + (val2 ? 1:0)) != 1) error "One and only one argument required"
3.3 Output channel emissions
Named file extensions MUST be emitted for ALL output channels e.g. path "*.txt", emit: txt
.
3.4 Optional inputs
Optional inputs are not currently supported by Nextflow.
However, passing an empty list ([]
) instead of a file as a module parameter can be used to work around this issue.
For example, having a module (MY_MODULE
) that can take a cram
channel and an optional fasta
channel as input, can be used in the following ways:
3.5 Optional outputs
Optional outputs SHOULD be marked as optional:
3.6 One output channel per output file type
Each output file type SHOULD be emitted in its own channel (and no more than one), along with the meta
map if provided ( the exception is the versions.yml ).
In some cases the file format can be different between files of the same type or for the same function (e.g. indices: .bai
and .crai
). These different file formats SHOULD be part of the same output channel since they are they serve the same purpose and are mutually exclusive.
This approach simplifies the process of retrieving and processing specific types of output, as each type can be easily identified and accessed within its designated channel.
So when the output definition of module called SAMTOOLS_INDEX
looks like this:
The output files can be accessed like this:
Regardless whether they are a bai
or crai
as downstream SAMTOOLS modules should accept either without an issue.
4 Documentation
4.1 Module documentation is required
Each module MUST have a meta.yaml
in the same directory as the main.nf
of the module itself.
4.2 Number of keywords
Keywords SHOULD be sufficient to make the module findable through research domain, data types, and tool function keywords
- Keywords MUST NOT just be solely of the (sub)tool name
For multi-tool modules, please add the keyword multi-tool
, as well as all the (sub)tools involved.
4.3 Keyword formatting
Keywords MUST be all lower case
4.4 Documenting of all tools
The tools section MUST list every tool used in the module. For example
4.5 Documentation of args of each piped or multiple command
The tools section MUST have a args_id:
field for every tool in the module that describes which $args
($args2
, $args3
) variable is used for that specific module. A single tool module will only have args_id: "$args"
.
4.6 Required channel documentation
Input and Output sections of the meta.yaml
SHOULD only have entries of input and output channels
4.7 Documentation of tuples
Input and output tuples MUST be split into separate entries
- i.e.,
meta
should be a separate entry to thefile
it is associated with
4.8 Input and output channel types
Input/output types MUST only be of the following categories: map
, file
, directory
, string
, boolean
, integer
, float
, boolean
, list
4.9 Correspondence of input/outputs entries to channels
Input/output entries MUST match a corresponding channel in the module itself
- There should be a one-to-one relationship between the module and the
meta.yaml
- Input/output entries MUST NOT combine multiple output channels
4.10 Useful input/output descriptions
Input/output descriptions SHOULD be descriptive of the contents of file
- i.e., not just ‘A TSV file’
4.11 Input/output glob pattern
Input/output patterns (if present) MUST follow a Java glob pattern
4.12 Indication of input channel requirement
Input entries should be marked as Mandatory or Optional
5 Module parameters
5.1 Module input and outputs
A module file SHOULD only define input and output files as command-line parameters to be executed within the process.
5.2 Use of parameters within modules
All params
within the module MUST only be initialised and used in the local context of the module.
In other words, named params
defined in the parent workflow MUST NOT be assumed to be passed to the module to allow developers to call their parameters whatever they want.
In general, it may be more suitable to use additional input
value channels to cater for such scenarios.
5.3 Specification of multiple-threads or cores
If the tool supports multi-threading then you MUST provide the appropriate parameter using the Nextflow task
variable e.g. --threads $task.cpus
.
5.4 Evaluation of parameter within a module
Any parameters that need to be evaluated in the context of a particular sample e.g. single-end/paired-end data MUST also be defined within the process.
6 Resource requirements
6.1 Use of labels in modules
An appropriate resource label
MUST be provided for the module as listed in the nf-core pipeline template e.g. process_single
, process_low
, process_medium
or process_high
.
6.2 Source of multiple threads or cores value
If the tool supports multi-threading then you MUST provide the appropriate parameter using the Nextflow task
variable e.g. --threads $task.cpus
.
If the tool does not support multi-threading, consider process_single
unless large amounts of RAM are required.
6.3 Specifying multiple threads for piped commands
If a module contains multiple tools that supports multi-threading (e.g. piping output into a samtools command), you can assign CPUs per tool.
- Note that [
task.cpus
] is supplied unchanged when a process uses multiple cores - If one tool is multi-threaded and another uses a single thread, you can specify directly in the command itself e.g. with
${task.cpus}
7 Software requirements
BioContainers is a registry of Docker and Singularity containers automatically created from all of the software packages on Bioconda. Where possible we will use BioContainers to fetch pre-built software containers and Bioconda to install software using Conda.
7.1 Use of container directives
Software requirements SHOULD be declared within the module file using the Nextflow container
directive.
For single-tool BioContainers, the nf-core modules create
command will automatically fetch and fill-in the appropriate Conda / Docker / Singularity definitions by parsing the information provided in the first part of the module name:
7.2 Use of conda directive
If the software is available on Conda it MUST also be defined in an environment.yml
file alongside the main.nf
of the module, and is passed to the Nextflow conda
directive within main.nf
.
Using bioconda::bwa=0.7.17
as an example, software MUST be pinned to the channel (i.e. bioconda
) and version (i.e. 0.7.17
).
Conda packages MUST not be pinned to a build because they can vary on different platforms.
7.3 Re-use of multi-tool containers
If required, multi-tool containers may also be available on BioContainers e.g. bwa
and samtools
.
You can install and use the galaxy-tool-util
package to search for both single- and multi-tool containers available in Conda, Docker and Singularity format.
E.g. to search for Docker (hosted on Quay.io) and Singularity multi-tool containers with both bowtie
and samtools
installed you can use the following command:
Build information for all tools within a multi-tool container can be obtained in the /usr/local/conda-meta/history
file within the container.
7.4 Creation of new multi-tool containers
It is also possible for a new multi-tool container to be built and added to BioContainers by submitting a pull request on their multi-package-containers
repository.
-
Fork the multi-package-containers repository
-
Make a change to the
hash.tsv
file in thecombinations
directory see here for an example wherepysam=0.16.0.1,biopython=1.78
was added. -
Commit the code and then make a pull request to the original repo, for example
-
Once the PR has been accepted a container will get built and you can find it using a search tool in the
galaxy-tool-util conda
package -
You can copy and paste the
mulled-*
path into the relevant Docker and Singularity lines in the Nextflowprocess
definition of your module -
To confirm that this is correct, spin up a temporary Docker container
And in the command prompt type
The packages should reflect those added to the multi-package-containers repo
hash.tsv
file -
If the multi-tool container already exists and you want to obtain the
mulled-*
path, you can use this helper tool.
7.5 Software not on Bioconda
If the software is not available on Bioconda a Dockerfile
MUST be provided within the module directory. We will use GitHub Actions to auto-build the containers on the GitHub Packages registry.
8 Testing
8.1 Snapshots
Only one snapshot is allowed per module test, which SHOULD contain all assertions present in this test. Having multiple snapshots per test will make the snapshot file less readable.
All output channels SHOULD be present in the snapshot for each test, or at a minimum, it MUST contain some verification that the file exists.
Thus by default, the then
block of a test should contain this:
When the snapshot is unstable another way MUST be used to test the output files. See nf-test assertions for examples on how to do this.
8.2 Stub tests
A stub test MUST exist for the module.
8.3 Tags
Tags for any dependent modules MUST be specified to ensure changes to upstream modules will re-trigger tests for the current module.
8.4 assertAll()
The assertAll()
function MUST be used to specify an assertion, and there MUST be a minimum of one success assertion and versions in the snapshot.
8.5 Assert each type of input and output
There SHOULD be a test and assertions for each type of input and output.
Different assertion types should be used if a straightforward process.out
snapshot is not feasible.
Always check the snapshot to ensure that all outputs are correct! For example, make sure there are no md5sums representing empty files (with the exception of stub tests!).
8.6 Test names
Test names SHOULD describe the test dataset and configuration used. some examples below:
8.7 Input data
Input data SHOULD be referenced with the modules_testdata_base_path
parameter:
9 Misc
9.1 General module code formatting
All code MUST be aligned to follow the ‘Harshil Alignment™️’ format.