Guidelines for Test Data
Edit

Guidelines for adding test data to nf-core repositories

Overview

The nf-core community currently hosts two sets of test data that are used to execute CI tests and validate results: pipelines and modules.

For consistency and keeping the test data repositories small, nf-core provides the following guidelines and recommendations that developers should aim to follow when adding new data.

The general philosophy to keep in mind for all new test data is: as small as possible, as large as necessary.

Reviewing

When reviewing, all modules should be checked that they follow test-data specifications.

Modules

Re-use existing test data

Use what already exists! The priority is use in your module tests whatever is already available on the modules branch of the nf-core/test-datasets repository. In addition to the modules branch, you can look through the other branches to find an appropriate dataset.

If a test data file for your module can be quickly generated by an upstream module, don’t add the input file to the test data. Run the upstream module in the test for your module. For example, if your module requires the generate of a very small index file, include the indexing module in the test main.nf file, and pass the output channel of the index to the new module.

In contrast, if your module requires many tests, steps, or heavy CPU usage, you should store such additional files on the test datasets repository (see below).

Adding new test data

Always ask on the #modules or #test-data channels on the nf-core slack before adding test data.
Add any new test data via a pull request from the corresponding branch of a personal fork of the nf-core/test-dataset repository.
- For example, to add test-data for a module, upload to the modules branch of your fork, and open the PR against the modules branch of the nf-core/test-datasets repo. For a pipeline, e.g. nf-core/mag, upload to the mag branch of your fork, and open the PR against the mag branch of the nf-core/test-datasets repo.
New bioinformatic test data files should be generated from the existing collections of test data as far as possible. See field-specific guidance below.

For example, if you were adding new genomic test data, and needed to create a specific type of genotyping file format for a particular tool - you should use the SARS-CoV2 or Homo sapiens BAM files as input when generating the file you will need to use your module tests. The resulting genotyping file should then be stored alongside the BAM file, or in the corresponding file format type directory.
In the worst case scenario where you cannot (re)use the existing data in the field-specific collections, the delete_me/ folder may be used.
- We discourage the use of this directory as far as possible.
- If you do use this directory, we ask you to replace anything added to this directory with proper test data based on the existing dataset as soon as possible.
For non-bioinformatic specific files (like simple text files, or tables), you can place these in the generic/ directory.
Files must have as small a file size as possible - aggressively subsample as much as you can.
Test data must be publicly available and have licenses that allow public reuse.
Files should generally be organised in directories based on their file extension.
Any test data files should be named based on the upstream file name and with a corresponding extension. For example, if you used genome.fasta as the upstream file, your output file should be called genome.<new_extension>.
You must update the README file under the Data Description section of the modules-branch of test-datasets repository to describe what the file is, and how it was generated (except for files added to delete_me/).
The test data pull request requires a review to be merged.
Once you’ve had your pull request merged in into nf-core/test-datasets, you are good to go and you can refer to the newly uploaded file(s) for your test(s). By default params.modules_testdata_base_path points to https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/ from the test-datasets repo, and refers to the data/ folder in the modules branch. Append to it the path to the file(s) you need. Here is an example using the fasta file from sarscov2:

    file(params.modules_testdata_base_path + 'genomics/sarscov2/genome/genome.fasta', checkIfExists: true)

Note that the file string corresponds to the directory structure of the modules repository.

Field specific guidance

Genomics

For genomics, nf-core aims to focus on a restricted set of reference genomes from different organisms. These have been mostly selected for having small genome sizes.

SARS‑CoV‑2
Homo sapiens (Chr 21)
Prokaryotes:
- Bacteroides fragilis
- Candidatus portiera aleyrodidarum
- Haemophilus Influenzae

All of these organisms have reference genomes, raw FASTQ files, BAM files etc.

The sub-directories of each of these organisms are stored in a (reference) genome folder (e.g. fasta, gtf, …), then technology specific raw-data files in the illumina, nanopore, pacbio and cooler subfolders whenever available. Each contains all typical data required for genomics modules, such as fasta, fastq, and bam files.

For genomics, generally the order of preference when extending the collections as in the order in the list above (i.e. try to use the SARS-CoV2 data first).

If you need to add a new species, you must discuss this with the nf-core community via the #modules or #test-data slack channels prior adding it.

Pangenomics

The pangenomics folder contains subfolders for all organisms for which test data is available. At the moment, there is one organism available:

Homo sapiens

The folder is structured in the following way: Any nonspecific-pangenome file is located in pangenome (e.g. PAF, GFA, …) and software specific binary files in the odgi subfolder. Pangenomics contains all typical data required for pangenomics modules, such as PAF, GFA files including the binary formats ODGI, and LAY. Every folder in pangenomics corresponds to a single organism. For every data file, a short description about how this file was generated is available either in this description or in the respective subfolder. All files in the pangenomics folder originates from a PGGB run using the HLA V-352962 gene FASTA.

Other fields

Please check the specific directories in test-datasets for other fields.

Pipelines

Each pipeline has their own dedicated branch on the test-datasets repository.

The guidance for these test-datasets are generally less strict, as each pipeline will require their own structures.

These will generally consist of

Small raw input files (FASTQ, FASTA etc.)
Samplesheets

However, generally similar concepts from modules also apply:

Re-use data as far as possible
- You are welcome to use the module tests data instead of storing your own copy in the pipeline-specific branch
Test data should be as small as possible and as large as necessary (subsampling is recommended if possible)
The test datasets should be documented on the README of the pipeline’s branch
- What files exist
- How they were generated
- Acknowledgments of sources of public data

Guidelines for Test Data Edit