
The âMaintainers Minutesâ aims to give further insight into the workings of the nf-core maintainers team by providing brief summaries of the monthly team meetings.
Overview
After a short summer break, we returned with a special maintainers meeting dedicated to nf-core/test-datasets. nf-core/testdatasets is a GitHub repository that holds the majority of the files we use for the CI testing of our modules, subworkflows, and pipelines.
Interacting with this part of our infrastructure is currently one of the less optimal developer experiences. This has been identified based on qualitative impressions from the community by the maintainers and core teams, confirmed with results from the nf-core/survey from earlier this year, and our own experiences.
As a first step in overhauling this experience we saw in the last major nf-core/tools release the addition of a new sub-command by Julian Flesch (@JulianFlesch) to help explore the available data files in the nf-core/test-datasets repository. However this only acts to alleviate the symptoms of the problems in identifying suitable files, knowing where to put new files, and what is within each file. Instead we want to restructure and develop clearer specifications, documentation, and procedures for the test datasets repository.
Therefore this monthâs meeting was âtaken overâ by the #wg-test-dataset-task-force leads Simon Simon (@SPPearce) and James (@jfy133) to start the process of redesigning the structure and documentation.
Scope of discussions
Some things we agreed on throughout the meeting to limit the scope of the discussions was
- We are primarily trying to address modules test-datasets (not pipelines)
- We agreed we want to âstart from scratchâ rather than try and adjust the existing repository
Location

One of the larger discussions we had was where should the test-dataset files go: should we move them to a new service?
We defined a set of criteria that we wanted to meet with the new location:
- Much faster to download (or clone)
- Support directories
- Free (or cheap) hosting
- Doesnât charge for ingress/egress
The pros of continuing using GitHub were:
- â Familiarity of our users with the interface (e.g. for reviewing)
- â It stays within our existing infrastructure
- â The 10 MB file limit is a good thing (forcing developers to ensure their tests are fast)
The cons of GitHub were:
- â (Currently) makes a very large repository for cloning
- â It only supports HTTPs/SSH interaction, so you cannot pass directories from the repository to Nextflow (where only S3 filesystems are supported for directory input)
- â The 10 MB file limit is a bad thing (some developers cannot physically get their data files that small, e.g. imaging)
- â It is hard to view the contents of any non-raw textfile
Alternative solutions were proposed:
- HuggingFace
- â Suggested by Edmund (@edmundmiller) as a similar interface to GitHub (thus would be familiar)
- â Much less restrictive file sizes (up to 5GB per file, and no max number of files)
- â But is outside our infrastructure
- â Is actually just a
git-lfs
, so actually doesnât provide much difference to GitHub (which also supports git-lfs) - â It would require separate team organisation (not everyone could join and have access)
- AWS S3:
- â Our test-datasets are actually already âbacked upâ here
- â This is already relatively well supported by our infrastructure and Nextflow (e.g. directory inputs)
- â Anabella (@atrigila) showed services such as 42basepairs that provides ways to see inside common bioinformatics file formats of files on S3
- â We were very worried about ingress/egress costs (particularly in our very parallelised CI tests)
- â We did not have an immediate solution how community members could âsubmitâ to a controlled bucket (for cost reasons)
- â We werenât sure on the longevity of services like 42basepairs
- Cloudflare R2
- â No ingress/egress fees, âflat rateâ for hosting based on amount
- â S3 filesystem
- â Would maybe need to ask for open-source credits⌠but no idea if available
Our main conclusions from these discussions were:
- We turn on
git-lfs
already for the existing GitHub repository, to make it easier to at least clone it - Edmund would investigate the Cloudflare option to get more information on the pros and cons of this option
Documentation

Next we moved onto documentation.
Trying to know what was in a module test-data file, how it was generated, and how it linked with other data files within nf-core/test-datasets is a common pain point for the maintainers and community members. Currently this relies on both the directory structure of the repository, and also a haphazardly and inconsistent README file in the root of the modules branch.
We had a brainstorming session of what sort of information we would like to recorded about each test data file:
- Keywords
- Is it real or simulated data?
- Is it a tool specific file vs a generic file?
- Command(s) was used to generate
- Version of the tool(s) was used for generation
- Source location of any upstream files
- Who created (author)
- Bioinformatics specific metadata
- Organism derived from
- Whole genome
- Chromosomes embedded
- Individuals
- Genome version
- Panel
- Support âgroupedâ files (e.g. in bioinformatics paired-end reads, ped/bim/fam, bam/bai)
We then thought about different ideas how to store such metadata:
- Using stricter and descriptive naming file scheme, to record metadata about the file, and a table with aggregating all the files
- A prose-based
README
markdown file next to each data file - A
meta.yaml
file next to each data file akin to nf-core/modules YAML files
Our primary conclusion here was that we needed to consult the community as to what other attributes they feel they need for test-data files. In particular we will try to contact different disciplines e.g. via the Special Interest Groups - particularly outside of bioinformatics - to ensure a consensus.
Structure

Finally, we briefly touched on the structure for the repository.
During the session there was a general feeling that we wanted per-tool documentation rather than one mega README file.
Assuming we stick with a GitHub interface we wanted to remove the âemptyâ master
branch and have the module test-data files as the primary landing page.
Simon (@SPPearce) also proposed having modules and pipeline test-data in separate locations to make it easier to find the right files and reduce the size.
However the structure somewhat would depend on the location we choose, we will wait for the outcome of the location discussions before we continue this. For example, if we were to follow an object storage concept, it could be that we go with âchaosâ with no directory structureâ and everything is organised and guided via the metadata with an user interface layer (as previously proposed by Maxime (@maxulysse)).
Additional considerations
Other points that were brought up such as:
- We should try to somehow âversionâ test-datafiles - e.g. using GitHub URLs pointing to a specific hash, to reduce risks of test breaking if someone changes the contents of a test file (although this shouldnât happen) (Jon (@pinin4fjords))
- We could maybe consider a âspill overâ location, in case we stick with GitHub and the 10 MB limits are too restrictive for some toolsâs test-data (which would reduce costs) (Louis (@LouisLeNezet))
- Is there a way to automatically identify data files that have never been used, so we can clean them to save costs (Famke (@famosab))
- None of the maintainers present were aware of anything like this, but if a community member has an idea please let us know!!
- Should we allow âcopyingâ of a toolâs own test-data files or always make our own derived from our existing files (where possible)
- Could we use an MCP agent to auto-annotate files with metadata as a first pass, some nf-core members have experience with these (Igor (@itrujnara))
And of course, we agreed all the decisions above should be converted into nf-core/proposal RFCs to facilitate wider community discussions (these will be announced on GitHub when posted!).
The end
All of the above are just starting points for discussions, and we will continue to work on these topics in the coming months. But we will need a large amount of input from the wider community to ensure the community gets the best experience they want, so we encourage anyone with thoughts and feedback on the above to join the #wg-test-data-task-force channel and post their ideas there!
As always, if you want to get involved and give your input, join the discussion on relevant PRs and Slack threads!
- â¤ď¸ from your #maintainers team!