Just 15 minutes + questions, we focus on topics about using and developing nf-core
pipelines. These are recorded and made available at https://nf-co.re
, helping to build an archive of training material. Got an idea for a talk? Let us know on the #bytesize
Slack channel!
This week, Alexander Peltzer (@apeltzer) will present: Software packaging.
This will include how to package a tool in Bioconda, and how that ends up in BioContainers where we can use it in pipelines.
Video transcription
The content has been edited to make it reader-friendly
conda-build
that you have to install manually. If you install Conda, it’s not always there. But you can use Conda to install Conda build. That will set up an environment where you could also locally test building your recipe, which will give you a bit of an error handling opportunity before actually pushing this to Bioconda. If you follow these steps, usually you should at least get somewhat a half functional recipe out, I would say, in some cases if you’re lucky. Especially that, at least for me, helped too in the most cases when you had a PyPi package that’s already built well.
Such an example recipe could look like this. Usually this is just a build.sh script, which is just used in the build step of the recipe. Then you have this meta.yml file, which describes some of the content of the recipe. Usually people set the version of the tool package up here and then just refer to this in the version string here. Then build numbers need to be changed at some point. If you, for example, bump a new version of a recipe, then you have to increase this. You have to list the source URL. This has to be a fixed URL, so it cannot be a URL that is overwritten all the time. […] Landled with at all. Then the requirements to build, to run, and also to host. The host requirements are actually listed here in the recipe. This is just an example. There are much more complicated ones out there, but there are also much more easier ones out there. This is a CC++ tool, which means some of the make compilers and the C compilers have to be present here, for example.
If you’re done with writing that recipe up, then what you could do is submitting a pull request to Bioconda and then waiting for the automated build checks and linting checks to hopefully tell you that your recipe is in order. Everything that needs to be done is done properly. However, I have to mention here, again, Bioconda and conda-forge are slightly different here. They have a bit of a different setup there. In Bioconda, you have everything in one big master repository. In conda-forge, it starts a bit differently. How that difference plays out in the end is actually listed in the documentation that I linked in one of the first slides. We cannot really cover that fully here. If you’re lucky and everything builds fine, then once somebody from the communities approves or reviews and then approves your recipe, then this will be merged. Your recipe will then be automatically available in the Bioconda and conda-forge package indices in a couple of minutes. Sometimes it takes a couple of hours, however. That depends on how fast the synchronization works.
Now we’ve been talking about Conda recipes and Bioconda recipes. But what about Docker and Singularity containers? Because as you know, most of the nf-core pipelines really strictly use Docker and Singularity containers all the time and not necessarily even have support for Conda recipes. What about that? Well, as it turns out, the Bioconda and the conda-forge communities really went into a quite good agreement. But with the biocontainers community… all the Conda and Bioconda recipes are automatically built as Docker containers and also as Singularity containers. If you click on the Bioconda package index, for example, the Samtools one that I just showed in one of the previous slides, you can just click here on the container button. Although it says none, it’s actually not none. It’s actually there. You will be seeing a list on Quay.io where the Samtools Docker images have been uploaded automatically by the Conda continuous integration service. These are automatically available, which means also if you create a new recipe, then automatically a Docker container for your recipe will be available in a couple of hours. Same applies to the Singularity containers. These are built by the Galaxy team and shared by a Galaxy Depot server, which is also linked here. You simply can directly download that from there and then have your package of choice available as a Singularity container. You don’t have to even write your own Docker file or a Singularity file. It looks like this. The only thing you have to do then, you can run directly from Quay.io, biocontainers, and then you have the Samtools version here. You can do the same with Singularity. There you have your Singularity URL with the Samtools container, although these are different versions here at the moment. But nevertheless, I think the point is clear.
However, that is always a relationship with one tool per container. If you download the Samtools container from biocontainers, you always have just Samtools in there. It’s nothing else. If you want to combine, for example, BWA and pipe the output from BWA to Samtools directly, you have to create a so-called multi-container, which is a multi-tool container, which is also a nice way of combining multiple tools together. If you, for example, in a pipeline want to pipe outputs from one tool to another in a single process step, which in some cases definitely makes sense. For example, automatically converting SAM output directly to BAM or CRAM output to make the compression play in hand. That usually makes sense to combine, for example, BWA and Samtools into one container. This can be done using the multi-tool container service also by the biocontainer community. There you only have to add a set of tools to a so-called hash file, which is just a text file - you add that - which versions you would like to combine, open a pull request with that, and then wait for this to be merged. Then after a couple of hours, you will have a combination of those as a separate container, which you can then use for your purposes.
Well, after all these containers and Conda packages, you probably are wondering how to use these containers efficiently in nf-core pipelines. A lot of people really made a lot of effort to make that much easier, especially with the DSL version 2 pipelines where you actually have modules available. In this case, as has been briefly outlined in the past, especially on the Slack channels around that and around building modules, we really rely on biocontainers and the nf-core tools methods around there, to actually make that as easy as possible for you. If you, for example, install multiple tools like FastQC, Samtools, and MultiQC in your pipeline using nf-core modules installed, these will automatically have pre-configured URLs with the latest versions of these respective tools in the modules description. You don’t have to worry about actually looking up these Docker and Singularity containers in such a case. If you, for example, write a new module, you can simply do that with nf-core modules create
. Then this would automatically ask you in an interactive way to tell your name which tool you would like to write a module for. Then it will automatically look up in the API of biocontainers whether there is already a container available. We try to get that in your module already.
Updates work very similar. If you want to know how to update such a module, then there’s also an update function there, that will automatically update the container URLs if the module code has been updated. If you build a new module, tools will always search biocontainers via an API to query these URLs for you. To summarize what we’ve learned about today, although not in very detail because time is limited, what we usually do, and that’s the standard approach to packaging software and tools for nf-core pipelines, is that we check Bioconda and conda-forge whether there is already existing recipe of the tool. If this is not existing, we typically try to add it to either Bioconda or conda-forge to make sure that it’s available to the broader community. We rely then on biocontainers in Galaxy to build a Docker container and keep the Singularity containers for us to be used.
What’s also a good idea is if you don’t want to maintain the recipe on your own, you can also rely heavily on nf-core modules which have pre-configured URLs already. What you always should do as well if you work with modules use nf-core tools because they automatically fetch and update the URLs in the modules for you if you need that. That was also briefly mentioned by someone in the Slack channel today, to me a good thing is also if you have any issues with Conda packages, then please try to use Mamba as a drop-in replacement. The commands are not really different. The only thing is that you get much better error outputs. You will know much better what went wrong, and you will also get much faster dependency resolving, which will tell you much faster where your issues are. For example, if you import a Python package that is incompatible with another Python package in your Conda environment, you will see that much quicker with Mamba than with the regular Conda.
Some last words maybe. Software packaging can really get complicated sometimes. To be very honest, I spent more hours than I would like to making Bioconda and conda-forge packages. But nevertheless, this always plays out in the end. Because once you’re there, when you did it once, it usually is really easy to update these Bioconda packages. It’s also nicer because there are many other people out there, especially from the other communities like Bioconda and conda-forge, who will automatically pick up packages and update them for you. They even have automated update bots that will from time to time check GitHub repository URLs and just send an update for your recipe, which in some cases, you can just review and then accept, and then you will have a new version of your tool available. If you do that manually, if you build your own Docker files, for example, all the time, you have to do all of the heavy lifting on your own, which is cumbersome and takes a lot of time. Maybe it’s a good idea to invest the time to bring everything to Bioconda and conda-forge and then just rely on that.
In case of doubt, always ask. There are, as I said, multiple communities around who are really happy to help. Then also we have the nf-core community Slack. The help channel, for example, you can also ask for guidance and input on your recipes. It’s not really a problem. We have a lot of people who have experience with this. If you’re a beginner and want to get somebody looking over it before you actually go to the, let’s say, hardcore Bioconda and conda-forge communities who are more experienced users, then you can also ask there if you want to.
Always remember, collaboration is a key factor there. If you do everything on Bioconda and conda-forge, it’s also good because everybody benefits, not just nf-core users who are using your packages maybe with a pipeline. But if somebody, for example, wants to use your tool for some custom analyses, they also will find this on Bioconda and conda-forge and they’ll use it, which means that you also get contributors and users for your own tools, for example, which is always great because you also get feedback. You also get improvements, sometimes feature requests, sometimes even PRs that help fixing things. It always played out nicely for me at least.
That’s just all the help pages that we have. If you have some questions, you can also just ask them now. Thank you.
(question) Thank you very much, Alex, for this insightful talk. There is a comment in the chat already pointing out maybe one further difference between Bioconda and conda-forge. They mentioned that conda-forge also targets Windows, Linux, and Mac, whereas Bioconda only targets Linux and Mac. That could be an additional difference.
(answer) Yes. That’s true.
(question) I also have a question, actually. My problem with the multi-containers or the multi-tool, the hash table is very nice to find something, what combinations already exist or to add a new one. But I always struggle to then find that long multi-container hash that actually already provides this tool. Is there an easy way to find this?(answer) Well, there’s two ways to do it. The first one would be if you open your pull request against this multi-tool containers, as someone approves your PR and merges it, an automated continuous integration service will pick this up and build it for you. You can go into the logs of that CI and find the URL, because at some point that CI also pushes that image to biocontainers. That’s how I do it, usually, because for me, it always felt like the most convenient way to do that. However, if I’m not completely wrong here, because I never used that before, there is also a service URL which can look for combinations of packages, which can use like a search engine, and then just look for the combination that you want to have. If you’re lucky, for example, there might also already be such a container. For example, BWA and SAM tools, I would envision this is a standard thing that a lot of people will have already and would like to have. There should be multiple versions with multiple combinations of the two tools existing. You don’t necessarily have to build your own then. […] Look that up. Yeah, so that’s just the two ways I know, but yeah.
(host) Thanks a lot. I think I also just know those two ways, so we would be interested to know more if there’s more.
(question) There’s another question by Phil. He asks, could you reiterate when you would change the build number?(answer) Yes, so maybe I go back to the recipe that you know about what we’re talking about here. For example, in some cases, for example, a recipe is broken. For example, if some of the dependencies of that recipes were broken […] it was broken because one of the libraries that bowtie used was broken on Bioconda. Unfortunately, bowtie didn’t release a new version in the meantime because bowtie itself was not broken, but the dependency was broken. In such a case, it would make sense to not change anything here, but just increase the build number to two here. Because that would then tell the CI, this continuous integration service, to rebuild the entire recipe, automatically pulling in the latest dependency, which is hopefully fixed by then, and then rebuild the entire thing in a way that it’s not broken without actually changing the version of the actual recipe, because that was not changed, obviously. You get a SAM tools 1.15-2 then available as a Conda recipe, and also the containers would have that -2 in the build number, which will hopefully be a fix. Usually, that is just used for a patch of dependencies or similar things.
(host) Yeah, thanks a lot.
(question) I actually also have a question now that we are here. I think when there is a new version of that package, then there are even automated PRs that will update the recipe for the new version, right? Can you tell us a bit more about this?(answer) The Bioconda community has an automated bot that queries all the URLs that are mentioned here in the source YAML files, and automatically tries to update them by taking the existing recipe, just adjusting [..] as a change, and also decreasing the build number to one again. I think it just does these three things. That runs, I think, all day or overnight or something like that, and then automatically opens pull request against the Bioconda repository. Then people can just go there. Usually, maintainers who already made that recipe available in the first place are tagged in this PR. Then people can just review, OK, this looks good. CI also run through in most cases because the dependencies are not usually changing that often. Then the update will go through quite quickly so that people don’t have to do that manually on their own. Yeah. If Phil, for example, updates MultiQC, usually the system picks that up within a couple of hours. Then you get a PR if Phil was not faster than the system to open that on himself, yeah.
(host) Great, that really facilitates work then with Bioconda.
(question) We have one final question I would say for today. Regarding the PyTest runner, how do we know which version of the PyTest runner is required if you know about it? Or it seems like a very specific question, though.(answer) That’s a good question, which I cannot answer at the moment, to be very honest. Because I’m not experienced too much in the details of the Bioconda and conda-forge continuous integration services. They have their own customization in place there. I’m not really familiar how they test Python packages inside of the container and the package building process. I’d have to look that up, actually, if that is something of concern.
(host) That’d be probably something to ask on the Bioconda Slack then.
(speaker) Yes, that could be something you could ask there.
(host) OK, so thank you very much, everyone. Thank you, especially you, Alex, for this interesting talk.(speaker) You’re welcome. Hope it helped.
(host) Definitely. I’m sure it will have lots of views.