Bytesize 40: Software packaging
Edit

Alexander Peltzer - Boehringer Ingelheim Pharma GmbH & Co. KG, Germany

Event start:: April 5, 2022 at 11:00
Event end:: April 5, 2022 at 11:30

Locations:

- Online
  - - https://www.youtube.com/watch?v=0ZpcYtKDZrU
    - https://doi.org/10.6084/m9.figshare.19923425.v1

Join us for our weekly series of short talks: nf-core/bytesize.

Just 15 minutes + questions, we focus on topics about using and developing nf-core pipelines. These are recorded and made available at https://nf-co.re , helping to build an archive of training material. Got an idea for a talk? Let us know on the #bytesize Slack channel!

This week, Alexander Peltzer (@apeltzer) will present: Software packaging.

This will include how to package a tool in Bioconda, and how that ends up in BioContainers where we can use it in pipelines.

Video transcription

Note

The content has been edited to make it reader-friendly

0:01 (host) So welcome, everybody, to another week in this bytesize talk series. Today, we have Alex Peltzer from the nf-core team and also a clinical bioinformatics lead at Böhringer Ingelheim. He will talk to us about software packaging for nf-core, a matter that we’ve been all concerned with when writing nf-core pipelines. Thanks a lot, Alex, for joining us today. We look forward to your talk.

0:28 Thank you, Gisela, for the introduction. As Gisela already mentioned, we’re talking a bit about software packaging today for nf-core. It’s not strictly speaking nf-core way in this case, but more or less also involving writing Conda recipes, Bioconda recipes for packaging any bioinformatics tools or general purpose tools to be used in nf-core pipelines, for example, and nf-core modules, of course. Just to give you a brief overview, we’re going to talk a bit about best practices in software packaging, what we’ve been drawing in the past, what we’ve been using in the past, but more or less focused on what we try to do nowadays. Because this is a limited bytesize talk, it’s not very long, I have to focus on the most important bits and cannot go into all the details. Especially now also when we come to the next point, Bioconda and conda-forge, what’s actually the difference between both of them, then also how to package a tool in Bioconda and conda-forge. This is also just a brief overview about what you have to do, what you have to follow, what are the caveats that you have to circumvent, if possible. We cannot go into full detail there, of course, because there are some peculiarities. As you all know, things can go wrong very quickly if you do things improperly. Then I’ll also talk a bit about bio-containers and document Singularity at a glance there and summarize and wrap that all up.

1:53 The best practices, I’m only focusing on the do’s here to limit us a bit in time consumption is that we usually try to work with other communities here in nf-core. We are heavily relying on upstream projects to package our software or tools for pipelines. That is true for both bioinformatics and general purpose tools. For example, bioinformatics tool of choice would be GADK or SAM tools, which are already packaged in Bioconda. But there are, of course, also other tools that are not strictly bioinformatics related, which are usually going, let’s say, just a Python library to color some output or something like that could go to conda-forge. Bio-containers is the preferred way in nf-core nowadays how we use containers, both Docker and Singularity, to actually use them in pipelines. What we actively try to do, and whenever you ask something around containerization, around packaging things in the nf-core, you will always get pointed towards these projects, these upstream projects. We really encourage people to contribute to these because this is not just for you a good idea to do that, but because you also will receive frequent updates, for example, of the packages that you push there.

3:09 Bioconda and conda-forge, if people start new with these, people are usually a bit confused what’s actually the difference. Do I have to push my packages to Bioconda? Do I have to get my packages to conda-forge? Actually, they’re very similar, but not, strictly speaking, the exact same thing. I already briefly mentioned it a bit. Bioconda, as the name suggests, already is really strictly focusing on bioinformatics tools. It could also be chemistry, computational chemistry tools, of course, but it’s more bio-related. Conda-forge is more for general purpose tools. If you have some fancy Python, some fancy R-based package that you would like to get there, you can actually push that to conda-forge. There’s this easy decision tree, Bioconda: life science-related, conda-forge: general purpose stuff. That’s where you have to get your packages there. As we’re all working towards making any tool, Bioconda and/or conda-forge tool package, we always try to either make the decision, whether it’s a bioinformatics tool or a conda-forge tool, that we have to push it there.

4:20 The packaging really relies on similar infrastructure. The setup is a bit different, but the overall things are very, very similar. It’s not strictly more complicated if you want to get something to conda-forge. They just have a tiny bit different setup if you produce a recipe for conda-forge. Whereas you have to produce a very different one for Bioconda. But if you know how to do a Bioconda recipe, you usually can learn how to do a conda-forge recipe very quickly as well. It’s not too complicated.

4:54 To guide you a bit through how that could look like, these are a couple of steps that you have to usually follow. The first step would always be to check if your tool is already available on Bioconda and conda-forge. The slides will be online after this talk as well. These things will be clickable. There is a link for Bioconda and conda-forge, which are direct links to the package index of both repositories. You can simply search for your tools. For example, if you would like to see whether there’s a tool, a recipe for SAM tools already available that packages SAM tools, then you can just click on Bioconda because there’s a bioinformatics tool, obviously, and then search for SAM tools. You will be seeing this page, so that you have multiple versions of the tool. You have dependencies of that package. For example, it depends on hdslip, but also some libgcc, libseplip, and some other dependencies. It also lists multiple tools that are relying on this recipe. Well, obviously, if you add a new one, there won’t be anybody relying on your recipe as of now. But in the future, that might actually change. This is quite a nice way to actually see whether there is something available already for your tool that you want to package.

6:07 The second point that you should usually follow here if you want to package something for Bioconda and conda-forge is to check the contributor documentation for adding to Bioconda and conda-forge. We both have very, very detailed documentation available how to do that in the respective case. Bioconda has a listing of a checklist that you can tick off, also giving you some hints on how to do that most efficiently. Same for conda-forge. Conda-forge has, as I said in the beginning, quite bit of a different approach how to do this. But nevertheless, they also have a step-by-step guide available on the page. Again, this is linked here with individual links to the respective pages. You don’t have to search for that. You can simply click there, and then just go there, and it will explain how to do this efficiently.

6:57 There’s also a bonus hint since most of the tools that we need to package for nf-core are Bioconda tools. It’s not as common to do conda-forge packages, but the majority of tools that we use in nf-core is a bioinformatics pipeline community is that we package things for Bioconda. There I have to say there is this bonus hint for Bioconda. Please just think about joining the GitHub channel and asking their detailed questions. If you experience issues with packaging things for Bioconda, there’s usually a really large crowd around, similar to what is around in nf-core, that can help you with your packaging needs for Bioconda recipes. Also, it’s quite advisable to join the GitHub organization of Bioconda because that makes your life easier. It gives you the permissions to review other recipes and learn, for example, by looking at other recipes more efficiently. Although it’s all open source, it also means it’s a bit easier because it can trigger the bots and notify people from the core team to have a look at the recipe if you’re a member of the Bioconda organization, which is similar to nf-core. It’s free. You can just join. It might take a couple of days, however. But nevertheless, you can do that.

8:09 The third step, of course, would be then writing your recipe. Usually, what I do there is I either rely on the templates. Again, there’s a link here for some exemplary templates. Or I just recycle a similar package recipe. For example, if I package a Python recipe, which I want to actually get to Bioconda, I actually typically try to look for another Python package that is already on Bioconda and then just try to figure out what I need to change to make my recipe work. However, I have to say your mileage may vary here because sometimes these are really different dependencies. Also, if you’re just a lucky person and somebody already made a PyPi package, for example, you could also use the skeleton templates where this is possible. That does not always work. But in some cases, if your package is already luckily on PyPi available, you can just go call the skeleton PyPi and the package name. That will automatically create a template for you that should also pull in and fill out the dependencies of your package, for example, so you don’t have to figure that out on your own. Similar things exist for R and some others as well. If you click on the link above here, you will find some more information on how to do that and how this is, for example, done for Perl tools and other tools out there.

9:34 A cool thing also that James mentioned before I started giving the talk here is also that you can test your recipe locally. This conda-build that you have to install manually. If you install Conda, it’s not always there. But you can use Conda to install Conda build. That will set up an environment where you could also locally test building your recipe, which will give you a bit of an error handling opportunity before actually pushing this to Bioconda. If you follow these steps, usually you should at least get somewhat a half functional recipe out, I would say, in some cases if you’re lucky. Especially that, at least for me, helped too in the most cases when you had a PyPi package that’s already built well.

10:24 Such an example recipe could look like this. Usually this is just a build.sh script, which is just used in the build step of the recipe. Then you have this meta.yml file, which describes some of the content of the recipe. Usually people set the version of the tool package up here and then just refer to this in the version string here. Then build numbers need to be changed at some point. If you, for example, bump a new version of a recipe, then you have to increase this. You have to list the source URL. This has to be a fixed URL, so it cannot be a URL that is overwritten all the time. […] Landled with at all. Then the requirements to build, to run, and also to host. The host requirements are actually listed here in the recipe. This is just an example. There are much more complicated ones out there, but there are also much more easier ones out there. This is a CC++ tool, which means some of the make compilers and the C compilers have to be present here, for example.

11:34 If you’re done with writing that recipe up, then what you could do is submitting a pull request to Bioconda and then waiting for the automated build checks and linting checks to hopefully tell you that your recipe is in order. Everything that needs to be done is done properly. However, I have to mention here, again, Bioconda and conda-forge are slightly different here. They have a bit of a different setup there. In Bioconda, you have everything in one big master repository. In conda-forge, it starts a bit differently. How that difference plays out in the end is actually listed in the documentation that I linked in one of the first slides. We cannot really cover that fully here. If you’re lucky and everything builds fine, then once somebody from the communities approves or reviews and then approves your recipe, then this will be merged. Your recipe will then be automatically available in the Bioconda and conda-forge package indices in a couple of minutes. Sometimes it takes a couple of hours, however. That depends on how fast the synchronization works.

12:41 Now we’ve been talking about Conda recipes and Bioconda recipes. But what about Docker and Singularity containers? Because as you know, most of the nf-core pipelines really strictly use Docker and Singularity containers all the time and not necessarily even have support for Conda recipes. What about that? Well, as it turns out, the Bioconda and the conda-forge communities really went into a quite good agreement. But with the biocontainers community… all the Conda and Bioconda recipes are automatically built as Docker containers and also as Singularity containers. If you click on the Bioconda package index, for example, the Samtools one that I just showed in one of the previous slides, you can just click here on the container button. Although it says none, it’s actually not none. It’s actually there. You will be seeing a list on Quay.io where the Samtools Docker images have been uploaded automatically by the Conda continuous integration service. These are automatically available, which means also if you create a new recipe, then automatically a Docker container for your recipe will be available in a couple of hours. Same applies to the Singularity containers. These are built by the Galaxy team and shared by a Galaxy Depot server, which is also linked here. You simply can directly download that from there and then have your package of choice available as a Singularity container. You don’t have to even write your own Docker file or a Singularity file. It looks like this. The only thing you have to do then, you can run directly from Quay.io, biocontainers, and then you have the Samtools version here. You can do the same with Singularity. There you have your Singularity URL with the Samtools container, although these are different versions here at the moment. But nevertheless, I think the point is clear.

14:42 However, that is always a relationship with one tool per container. If you download the Samtools container from biocontainers, you always have just Samtools in there. It’s nothing else. If you want to combine, for example, BWA and pipe the output from BWA to Samtools directly, you have to create a so-called multi-container, which is a multi-tool container, which is also a nice way of combining multiple tools together. If you, for example, in a pipeline want to pipe outputs from one tool to another in a single process step, which in some cases definitely makes sense. For example, automatically converting SAM output directly to BAM or CRAM output to make the compression play in hand. That usually makes sense to combine, for example, BWA and Samtools into one container. This can be done using the multi-tool container service also by the biocontainer community. There you only have to add a set of tools to a so-called hash file, which is just a text file - you add that - which versions you would like to combine, open a pull request with that, and then wait for this to be merged. Then after a couple of hours, you will have a combination of those as a separate container, which you can then use for your purposes.

15:57 Well, after all these containers and Conda packages, you probably are wondering how to use these containers efficiently in nf-core pipelines. A lot of people really made a lot of effort to make that much easier, especially with the DSL version 2 pipelines where you actually have modules available. In this case, as has been briefly outlined in the past, especially on the Slack channels around that and around building modules, we really rely on biocontainers and the nf-core tools methods around there, to actually make that as easy as possible for you. If you, for example, install multiple tools like FastQC, Samtools, and MultiQC in your pipeline using nf-core modules installed, these will automatically have pre-configured URLs with the latest versions of these respective tools in the modules description. You don’t have to worry about actually looking up these Docker and Singularity containers in such a case. If you, for example, write a new module, you can simply do that with nf-core modules create. Then this would automatically ask you in an interactive way to tell your name which tool you would like to write a module for. Then it will automatically look up in the API of biocontainers whether there is already a container available. We try to get that in your module already.

17:24 Updates work very similar. If you want to know how to update such a module, then there’s also an update function there, that will automatically update the container URLs if the module code has been updated. If you build a new module, tools will always search biocontainers via an API to query these URLs for you. To summarize what we’ve learned about today, although not in very detail because time is limited, what we usually do, and that’s the standard approach to packaging software and tools for nf-core pipelines, is that we check Bioconda and conda-forge whether there is already existing recipe of the tool. If this is not existing, we typically try to add it to either Bioconda or conda-forge to make sure that it’s available to the broader community. We rely then on biocontainers in Galaxy to build a Docker container and keep the Singularity containers for us to be used.

18:27 What’s also a good idea is if you don’t want to maintain the recipe on your own, you can also rely heavily on nf-core modules which have pre-configured URLs already. What you always should do as well if you work with modules use nf-core tools because they automatically fetch and update the URLs in the modules for you if you need that. That was also briefly mentioned by someone in the Slack channel today, to me a good thing is also if you have any issues with Conda packages, then please try to use Mamba as a drop-in replacement. The commands are not really different. The only thing is that you get much better error outputs. You will know much better what went wrong, and you will also get much faster dependency resolving, which will tell you much faster where your issues are. For example, if you import a Python package that is incompatible with another Python package in your Conda environment, you will see that much quicker with Mamba than with the regular Conda.

19:27 Some last words maybe. Software packaging can really get complicated sometimes. To be very honest, I spent more hours than I would like to making Bioconda and conda-forge packages. But nevertheless, this always plays out in the end. Because once you’re there, when you did it once, it usually is really easy to update these Bioconda packages. It’s also nicer because there are many other people out there, especially from the other communities like Bioconda and conda-forge, who will automatically pick up packages and update them for you. They even have automated update bots that will from time to time check GitHub repository URLs and just send an update for your recipe, which in some cases, you can just review and then accept, and then you will have a new version of your tool available. If you do that manually, if you build your own Docker files, for example, all the time, you have to do all of the heavy lifting on your own, which is cumbersome and takes a lot of time. Maybe it’s a good idea to invest the time to bring everything to Bioconda and conda-forge and then just rely on that.

20:32 In case of doubt, always ask. There are, as I said, multiple communities around who are really happy to help. Then also we have the nf-core community Slack. The help channel, for example, you can also ask for guidance and input on your recipes. It’s not really a problem. We have a lot of people who have experience with this. If you’re a beginner and want to get somebody looking over it before you actually go to the, let’s say, hardcore Bioconda and conda-forge communities who are more experienced users, then you can also ask there if you want to.

21:05 Always remember, collaboration is a key factor there. If you do everything on Bioconda and conda-forge, it’s also good because everybody benefits, not just nf-core users who are using your packages maybe with a pipeline. But if somebody, for example, wants to use your tool for some custom analyses, they also will find this on Bioconda and conda-forge and they’ll use it, which means that you also get contributors and users for your own tools, for example, which is always great because you also get feedback. You also get improvements, sometimes feature requests, sometimes even PRs that help fixing things. It always played out nicely for me at least.

21:43 That’s just all the help pages that we have. If you have some questions, you can also just ask them now. Thank you.

(question) Thank you very much, Alex, for this insightful talk. There is a comment in the chat already pointing out maybe one further difference between Bioconda and conda-forge. They mentioned that conda-forge also targets Windows, Linux, and Mac, whereas Bioconda only targets Linux and Mac. That could be an additional difference.

(answer) Yes. That’s true.

22:23 (question) I also have a question, actually. My problem with the multi-containers or the multi-tool, the hash table is very nice to find something, what combinations already exist or to add a new one. But I always struggle to then find that long multi-container hash that actually already provides this tool. Is there an easy way to find this?

(answer) Well, there’s two ways to do it. The first one would be if you open your pull request against this multi-tool containers, as someone approves your PR and merges it, an automated continuous integration service will pick this up and build it for you. You can go into the logs of that CI and find the URL, because at some point that CI also pushes that image to biocontainers. That’s how I do it, usually, because for me, it always felt like the most convenient way to do that. However, if I’m not completely wrong here, because I never used that before, there is also a service URL which can look for combinations of packages, which can use like a search engine, and then just look for the combination that you want to have. If you’re lucky, for example, there might also already be such a container. For example, BWA and SAM tools, I would envision this is a standard thing that a lot of people will have already and would like to have. There should be multiple versions with multiple combinations of the two tools existing. You don’t necessarily have to build your own then. […] Look that up. Yeah, so that’s just the two ways I know, but yeah.

(host) Thanks a lot. I think I also just know those two ways, so we would be interested to know more if there’s more.

24:25 (question) There’s another question by Phil. He asks, could you reiterate when you would change the build number?

(answer) Yes, so maybe I go back to the recipe that you know about what we’re talking about here. For example, in some cases, for example, a recipe is broken. For example, if some of the dependencies of that recipes were broken […] it was broken because one of the libraries that bowtie used was broken on Bioconda. Unfortunately, bowtie didn’t release a new version in the meantime because bowtie itself was not broken, but the dependency was broken. In such a case, it would make sense to not change anything here, but just increase the build number to two here. Because that would then tell the CI, this continuous integration service, to rebuild the entire recipe, automatically pulling in the latest dependency, which is hopefully fixed by then, and then rebuild the entire thing in a way that it’s not broken without actually changing the version of the actual recipe, because that was not changed, obviously. You get a SAM tools 1.15-2 then available as a Conda recipe, and also the containers would have that -2 in the build number, which will hopefully be a fix. Usually, that is just used for a patch of dependencies or similar things.

(host) Yeah, thanks a lot.

25:59 (question) I actually also have a question now that we are here. I think when there is a new version of that package, then there are even automated PRs that will update the recipe for the new version, right? Can you tell us a bit more about this?

(answer) The Bioconda community has an automated bot that queries all the URLs that are mentioned here in the source YAML files, and automatically tries to update them by taking the existing recipe, just adjusting [..] as a change, and also decreasing the build number to one again. I think it just does these three things. That runs, I think, all day or overnight or something like that, and then automatically opens pull request against the Bioconda repository. Then people can just go there. Usually, maintainers who already made that recipe available in the first place are tagged in this PR. Then people can just review, OK, this looks good. CI also run through in most cases because the dependencies are not usually changing that often. Then the update will go through quite quickly so that people don’t have to do that manually on their own. Yeah. If Phil, for example, updates MultiQC, usually the system picks that up within a couple of hours. Then you get a PR if Phil was not faster than the system to open that on himself, yeah.

(host) Great, that really facilitates work then with Bioconda.

27:33 (question) We have one final question I would say for today. Regarding the PyTest runner, how do we know which version of the PyTest runner is required if you know about it? Or it seems like a very specific question, though.

(answer) That’s a good question, which I cannot answer at the moment, to be very honest. Because I’m not experienced too much in the details of the Bioconda and conda-forge continuous integration services. They have their own customization in place there. I’m not really familiar how they test Python packages inside of the container and the package building process. I’d have to look that up, actually, if that is something of concern.

(host) That’d be probably something to ask on the Bioconda Slack then.

(speaker) Yes, that could be something you could ask there.

28:26 (host) OK, so thank you very much, everyone. Thank you, especially you, Alex, for this interesting talk.

(speaker) You’re welcome. Hope it helped.

(host) Definitely. I’m sure it will have lots of views.

Bytesize 40: Software packaging Edit

Bytesize 40: Software packaging
Edit