In this tutorial we will see how to create a new module for the nf-core modules repository. As an example, we will create a module to execute the FastqToBam function of the FGBIO suite of tools.
Introduction
If you create a new module with the goal of contributing the code to nf-core, we recommend to familiarise with the community guidelines and use nf-core tools as explained below.
Module guidelines
The nf-core community has agreed on a minimal set of guidelines, intended to make module most suitable for general use, i.e. to be shared across a wide variety of community workflows.
nf-core tools
Using nf-core tools is the best way to adhere to the guidelines, without worrying too much and writing things from scratch. On the website you can find more details about installation, and all functionalities for modules.
Test data
Even before beginning the development of a module, you should identify a small dataset you can use to test its functionality. Ideally, the dataset is existing already and can be shared with other test workflows for other modules.
♻️ This is in active development, keep an eye for available test data here and how to access them using a config file (see this change).
If you must create your own test data, make sure you follow the test data guidelines.
Fork the nf-core/modules repository and branch
The first step, to contribute a module to the community repository is to fork *nf-core modules into your own account or organisation. To do this, you should click on the top-right of the nf-core modules repository, and choose “fork” as shown in the figure below.
You then choose the account or organisation you want to fork the repository into. Once forked, you can commit all changes you need into the new repository.
In order to create a new module, it is best to branch the code into a recognisable branch. You can do this in two ways.
-
You can create a new branch locally, on the terminal, using the following command:
-
The branch will be synchronised with your remote once you push the first new commit.
-
You can use the GitHub interface
-
To do this, you can select the dropdown menu on the top-left of your repository code, write the name of the new branch and choose to create it as shown below:
-
You will then sync this locally (ideally, you clone the forked repository on your working environment to edit code more comfortably)
-
Create the module template
Using nf-core/tools it is very easy to create a new module. In our example, we change directory into the repository (modules) and we type
- The first word indicates the tool (i.e. the software or suite)
- We separate the keys by a forward slash (
/
) - The second word indicates the function of the tool we are creating a module for
nf-core tools will set up an empty module for you based on a template. Follow the instructions on the command-line and a new module will be created in the fgbio folder.
One parameter you have to specify during the creation is the process resource label
based on the process label in the base.config
files in workflows. You have to select one of the following labels:
- process_single
- process_low
- process_medium
- process_high
- process_long
- process_high_memory
For now can just select the default and continue.
In the next step you have to specify if you need a meta map
with sample-specific information. Our test module will need a meta map, so select yes for the moment. If you want to know what exactly meta maps are check out the documentation here.
Magic will happen now: nf-core tools will create the following entries for the code of the module itself and also for the testing of the module
Each of the files is pre-filled according to a defined nf-core template.
You fill find a number of commented sections in the file, to help you modify the code while adhering to the guidelines, as you can appreciate in the following figure.
The above represents the main code of your module, which will need to be changed. NF-core tools will attempt at retrieving the correct containers (for Docker and for Singularity) as well as the Conda recipe, and those files will be pre-filled for you.
Now you just have to write the code.
Write the code
FGBIO command line for the function FastqToBam looks like the following:
Here you should first identify:
- the inputs you need, which are mandatory
- the inputs / arguments, which are optional
- the outputs
- any value or variable you might need, associated with the sample (for example, the sample ID or other metadata)
Make sure to check the guidelines for what you should and should not include.
Inputs and Outputs
As described in the guidelines, any information that will differ between multiple runs of the same module in a workflow (e.g. when a module is run across many samples as in fgbio) should be passed as an input, as part of a groovy map called meta. This is part of a tuple which includes the read file(s).
In our case, FGBIO also has a mandatory argument, which is not sample-specific, i.e. the read structure: this refers to the position and structure of the UMI barcode in the read. Such information will be the same for all samples and characteristics of the kit used to prepare the sequencing library. Since it is not sample specific, we will not include it in the meta map. Since it is a mandatory argument, we have decided to add it to the input list: in this way, it will be visible to others who wish to reuse this module, and it will be described explicitly in the metadata YAML file.
Therefore, once we modify the template accordingly, our inputs and outputs will look like this:
Passing optional args
Within nf-core modules any optional non-file parameters should be passed within a variable called args
. At a pipeline level, these arguments are pulled into the modules via an ext.args
variable that is defined in a modules.conf
file. The task
in task.ext.args
and task.ext.prefix
refers to the name of your module.
We now can substitute all our parameters with our predefined inputs, outputs and args. With the modules.conf
every other user can run the module with the additionel arguments of choice. After the substitutions the code of the script looks as follows:
Export the version of the tool
Before wrapping up our code, we need to add a line to output the software version. This must go in the HEREDOC section of the end of the script block.
Note the template you generate will include an example samtools command! Make sure to replace this!
Usually a software prints their version with a code similar to this
However, in some cases the software outputs the version as stderr and causes an exit that is recognised by Nextflow as if the process ended with an error.
In order to avoid that, we can in general print the version as part of an echo statement, like this
or pipe the output as follow
Notice the escape \$
of the first $
sign to distinguish between bash variables and nextflow variables.
sed '3!d'
Extracts only line 3 of the output printed bysamtools --version
- Determine whether the error printed to stderr or stdout, by trying to filter the line with
sed
- If it works, then you’re reading from stdout, otherwise you need to capture stderr using
|&
which is shorthand for2>&1 |
sed 's/pattern/replacement/'
can be used to remove parts of a string..
matches any character,+
matches 1 or more times.- You can separate sed commands using
;
. Often the pattern :sed filter line ; replace string
is enough to get the version number - It is not necessary to use
echo
- For non-zero error code:
command --version || true
orcommand --version | sed ... || true
- If the version is at a specific line you can try
sed -nr '/pattern/p'
that will return only the line with the pattern - To extract the version number in the middle you can also use regex pattern with
grep
as follows:grep -o -E '([0-9]+.){1,2}[0-9]'
- If multiple lines are returned you can select the first one with
tool --version | head -n 1
Unfortunately, FGBIO manages to cause an error exit even with this solution, and we are therefore forced to use a few bash
tricks to re-route the version and format it to be just the semantic number.
This may take a bit of time to get right.
Once that’s complete, our final script will therefore look like this:
It is always good practice to commit regularly while you write the code and comment the commit with a meaningful message. This way, you will always be able to revert the changes at any time.
Fill in the meta.yaml
Once the main module code is written, it is often a good point to fill in the meta.yml
file sitting alongside the main.nf
of the module.
Here you will document key words, context information about the module, and most importantly document the input and output requirements. In general, it follows a similar shape as the pipeline schema but is no JSON file. At the top you should add the name of the module, a short description and at least three keywords, which describe the module. Afterwards, describe all used tools, usually only one. The main part of the meta.yml
should be about the input and output requirements, which follow the same fields as the pipeline schema for a file parameter. For each input and output requirement you have to add a type, a short description about the content and a pattern. The last block contains the authors, who worked on the module, to allow other users to easily reach out to them. If you are the main developer of the module, your GitHub name will be automatically added to the meta.yml
. The types in the meta.yml
are limited to map, file, directory, string, integer and float. In this example module, the prebuild meta.yml
is already filled and the input part looks as follows:
Maps and files as shown above are the two main input / output requirements. For the other input / output types check out the guidelines and other modules.
Lint your code
Now that you’ve completed code development, you are ready to check if your code is clean and up to standards.
This can also be done easily using nf-core tools just by changing folder into the parent modules directory and typing the command
You will expect no test failed, as shown in figure below:
For more information on fixing linting errors in your code both locally and directly in your pull request in GitHub, check at the end of this subsection here.
Test your code
Once your code is polished, following any suggestions from linting, you should test the code and make sure everything works as expected. This can also be done automatically, using nf-test.
Create a test workflow
As described above, nf-core tools has created already the following files ready for you to modify.
Create a test snapshot
We are using nf-test as our testing framework. You can find more information at nf-test official docs and in this bytesize talk.
Snapshots are a useful tool to prevent unexpected changes in the output of a process or workflow, as described in the docs.
Once all your tests pass successfully, it’s time to submit the module to nf-core/modules!
Create a Pull Request
Back on GitHub, creating a Pull Request is very simple: on the top right of your repository you can click on the link “Pull request” as shown in the figure below:
If you have initiated the pull request from your forked repository, the direction of the request should be indicated by the arrow, as in the picture below, i.e. from your fork to the nf-core original repository
You can find more information on the GitHub guide and the nf-core talk Bytesize 4: GitHub contribution basics.
Make sure you are submitting the newly created branch, where your new module has been developed, into the master branch of nf-core modules.
A pull request will be created: Volunteers will review your code and will use comments and requests for changes on GitHub to interact with you and suggest changes if necessary. This is a collaborative and very interesting part of your development work.
Enjoy!