Bytesize 35: Troubleshooting a failed pipeline
Edit

Phil Ewels - National Genomics Infrastructure / SciLifeLab, Sweden

Event start:: March 1, 2022 at 12:00
Event end:: March 1, 2022 at 12:30

Locations:

- Online
  - - https://youtu.be/z9n2F4ByIkY
    - https://doi.org/10.6084/m9.figshare.19382933.v2

Join us for our weekly series of short talks: nf-core/bytesize.

Just 15 minutes + questions, we focus on topics about using and developing nf-core pipelines. These are recorded and made available at https://nf-co.re , helping to build an archive of training material. Got an idea for a talk? Let us know on the #bytesize Slack channel!

This week, Phil Ewels (@ewels) will show us how to troubleshoot a failing pipeline. He’ll cover common problems, where to start looking for issues and how to ask for help in the most effective way.

Video transcription

Note

The content has been edited to make it reader-friendly

0:01 Hi everyone, thank you for joining today’s bytesize talk. First of all I’d like to thank the Chan Zuckerberg Initiative for funding all nf-core events and as always the talk will be recorded and shared on our YouTube platform and shared on Slack as well, so if you’re not able to catch all of it now you can catch it up later in those spaces. And today we’re glad to have Phil who is an bioinformatician at SciLifeLab in Sweden and also the author of MultiQC who will be presenting on troubleshooting failed nf-core pipelines and it’s going to be roughly a 15 minute talk and then we’ll have more of a Q&A session and discussion at the end so feel free to use the chat box or unmute yourself and pose any question or comment at the end. Over to you Phil.

1:01 Thank you very much, thanks for the introduction, it’s nice to be back giving another bytesize talk, it’s been a few months since I’ve done one and quite a nice topic today I think. I’m hoping that this will be a good resource for especially new people to nf-core who are just trying to pick up running Nextflow and our nf-core pipelines and might be running into trouble. The idea today is to just run through some of the common questions and queries that we see on Slack when people try and run pipelines and hit difficulties and walk you through my personal typical steps of what I do when something goes wrong. I’d like to point out that this talk is aimed at end users, so people running pipelines. I’m not going to go into the original title: “debugging a failed pipeline”, it’s not really debugging, I’m not going to go into the code of the pipelines themselves. I think that would be a good follow on talk. And also like I say, like many of these talks, this is my personal take on it. I’m looking forward to hearing what everyone says in the chat and the discussion afterwards, about what things you do if you’re a bit more experienced and if you have any suggestions and hopefully that part of today’s talk will be as good as my slides. Right, so I’ll kick off, let’s see if I can get Zoom working, yes, because at some point things will go wrong.

2:33 I don’t care how experienced a bioinformatician you are, how many years you’ve been using Nextflow, stuff can and will go wrong if you run enough pipelines. This is your lifeline, so take a step back, take a deep breath, try not to send your keyboard through your computer monitor and we’ll walk you through how to get things up and running again. I’ve broken the talk up into five sections, these are the steps I take. The simple one is to start small, start simple, start small. We say this over and over again but you can’t repeat it too many times really. We use this -profiled test, we use that for all of our automated testing and that should always work. If you are starting to use Nextflow or you’re running a new pipeline for the first time it’s always a good idea to run this first with a profile test, keep everything as small and minimal as possible and check that it works, because it should work, it should be passing on the automated tests. If it doesn’t then it means there’s something wrong at your end with the way you’re running Nextflow, with the config, with something that’s outside of the pipeline itself. It isolates where the problem is coming from which is what this is all about. Start small, don’t use up some massive dataset that’s going to take days, just check that the pipeline runs as you expect with a minimal test dataset.

4:05 Also if you’ve hit a problem just check the basics, you don’t have to go far in the slack history to see that lots and lots of people’s problems have been resolved by updating Nextflow. Nextflow releases come out fairly frequently and within nf-core we tend to use many of the latest features of Nextflow. Many of the latest features of Nextflow come as a result of us requesting them so it’s not really a surprise. The first thing I always do if something goes wrong is I just check that I’m running a latest Nextflow stable release. If you’re running a latest Edge version then that’s also interesting and maybe try with the stable version because that could be important for the pipeline developers to know. Then there’s all the other really simple stuff: have you got enough disk space? Pipelines will fail in weird and unpredictable ways if you run out of disk space. If you’re using Docker, do you have a Docker daemon running in the background? Did you remember to start it? Just run through these basic things and often that will get you up and running. Wherever we see common things coming up within nf-core we try and add it to the website on the troubleshooting documentation page. If you haven’t already, just have a scan through that and see if what’s happening to you is mentioned there.

5:21 Most people will do that stuff without really thinking about it. But next is to categorize what kind of error it is that you’re seeing. Just because Nextflow fails on a certain step of a pipeline doesn’t necessarily mean that it was that step, that it was that software tool, which was responsible for the failure. Different types of failures happen at different times in the execution pipeline. We’re going to go through that now. Errors can happen before the first process kicks off. Right when you first run Nextflow. They could happen during that first process. When Nextflow actually tries to run something, it fails. It could happen somewhere else during the run and always something wrong at the very end. These are different steps. And one of the most common is before the first process. You try and run Nextflow and it just kills, dies immediately. All of these examples I have mostly taken either from myself or from searching the Slack history. Apologies in advance if you see one of your queries on Slack coming up as an example, I’m not picking on anyone. It’s just typical examples.

6:33 This is a very obvious one “Unknown config attribute: projectDir”. Nextflow found something in a config which it doesn’t recognize. And the reason is that this particular attributes is only available in more recent versions of Nextflow. You can see at the top that this is running 0.27.3, which is years old now. Not very surprising. Nextflow isn’t up to date using your version, it should work. This is very obvious. This happens right away. You’ve only got a couple of lines of output here, but it’s not always that obvious. You could be running the RNA pipeline like here. You get tons of output. All looks good. It’s nice and colored. Everything’s fancy, but you got to really obscure error spat out. Just take a step back, go back up to the top. Sure enough, this version is not very out of date, but a little bit out of date. And that’s enough in this case to make the pipeline fail. Nextflow version, always check that first.

7:28 Remember if you’re new to Nextflow and nf-core, you need to tell Nextflow how to handle software dependencies. Out of the box, if you just run a pipeline without any arguments, Nextflow will expect all of that software to be installed on your machine, which is almost certainly not going to be happening. You need to tell it, I want to use Conda, I want to use Docker, I want to use Singularity or there’s about eight different types of engines, which we can use to handle software dependencies automatically for you, but you need to tell Nextflow which one to use. Typically, we do this with a config profile. Here I’ve got test,docker, I’m saying run the test profile and use Docker to do it. Of course, you might want to use a different tool here, or you might have your own config, which defines which software tool to use here. It might be the name of your institutional config here or something. Make sure that you don’t have any spaces. If you have a space there, then it will just run profile tests and ignore the Docker, and your pipeline won’t have any software to use. Small thing, catches a lot of people, including myself, I’ve done it lots of times.

8:40 What you’ll get very often when something goes wrong, especially if the pipeline fails within the first process or within natural execution of a pipeline is a lot of output. And this can be quite intimidating. Nextflow really tries to help you with figuring out what’s gone wrong. And to do that, it tells you everything it possibly knows about the step that was going on when it failed. And there’s quite a lot of output, and this isn’t all of it here. But let’s try and take a pause and try and work through it. And once you get used to looking at these kinds of errors and break down the different sections, they’re quite quick to skim through. What we’re really going for here is always finding the relevant part of the log, which bit is telling you what’s wrong.

9:28 Here you can see the bit that pops out to me when I see this is: “command not found”. Okay, so this was a step in a pipeline, the first step, and it looks like maybe it’s something that is wrong with us here from the RNA pipeline. But when you look at this, I see “command not found”, this is almost certainly a software packaging problem. This has been run without Docker. Nextflow doesn’t know where to find the tool that it’s trying to run. And so it exits with an error saying the command is not found. Add -profile Docker or something similar, and this will fix itself. Other typical ones within this first process could be something to do with actually submitting the job to your compute environment. Here I’m trying to, or someone was trying to submit a job to a SLURM HPC cluster using sbatch. And here it said the area is the top caused by “failed to submit process to grid scheduler for execution”. There’s an sbatch error. And in fact, you can see under the command output, it actually tells you what was wrong. Again, this is not a problem to do with the pipeline, this is a problem to do with your config.

10:36 I touched on this already, but let’s break down that log and try and get used to what it’s telling us because there’s a lot of text to look at here, but the structure is always the same. We have at the top information about where you were in a pipeline and what kind of error there was. The top line says, okay, this is the process. Every pipeline is built up by lots of different processes that run in order. This is the name of the process that went wrong. And in brackets, you’ve got the tag, which in this case is the name of the file where it broke. And it says caused by, and that’s a summary headline of what went wrong. Here Nextflow was expecting some output and it didn’t get it. It was expecting a zip file and it wasn’t generated. And then it says, okay, this is what I was running. This is the exact bash command, which to be honest, for nf-core and for us is rarely interesting. Most of the time you can trust this, but that’s the resolved command that was run. And then you’ve got the exit status, which is the status that was generated by the command when it finished. Usually non-zero, it means error and zero means success. But in this case, we got zero even though it was an error. Next up in the log, we have the actual output from that tool. Command line tools can generate two types of output in a terminal. You can have standard out and standard error, but for the purposes of this talk they are one and the same thing. And so this is just telling us, two different types of output that we got from FASTQC in this point. There wasn’t anything on the command standard out, but the standard error gives us a big blob of text. And if you run FASTQC yourself manually in a terminal, if you run FASTQC, this is what would be printed out basically. Okay. There’s a bit of a misleading red herring at the top. That warning message is actually not related to the error in this case.

12:31 If you keep reading, what looks interesting is here. FASTQC is telling us what went wrong. It’s just buried. It’s saying: “Your file is probably truncated”. This error is almost certainly due to a corrupted file, a download that didn’t finish. This again is very common and you just need to work your way through the log file and the output to figure this out and try and spot that little nugget of interesting information in here. This is another example. This is running samtools and again, same thing, here it is buried in there, “samtools sort: truncated file”. These are all examples I pulled out of Slack.

13:11 If you need to dig into this a bit more though, this is just the main output from Nextflow running in the terminal, when you run the pipeline. But you can start to dig into this specific process a bit more. And that’s where the next bit of log is useful. Here it tells you where that process was running. Every process generates multiple tasks and each task runs in a work directory and an isolated file system. And so here, this is the path to that work directory. And you can go in there and we can start to dig around in those files and see if we can spot anything that wasn’t immediately obvious in the summary log output we have here. What’s in a typical work directory, anatomy of a work directory? You have all the input files and any output files that were generated by the task, but you’ll normally have a core set of files, which Nextflow itself generates. You have a bunch of files, which just capture the output from the tool. I’ve mentioned this already that you’ve got standard out and standard error, and you have a file for each and command.log captures both into one file, which might be useful if you want to know what order different stuff came out in. You have files, which Nextflow uses to track and run the job itself. The exit code file just captures about zero or non-zero value. You’ve got the trace, you’ve got the command.begin, which to be honest, I’m not sure I’ve ever looked at. And then what’s usually the stuff that’s most interesting, after the output from the tool, is the command.run and command.sh. The final one is the bash script, and that’s just the resolved command, which is run. You can try running that yourself on the command line, but that won’t use any of the software stuff like Docker and things. The command.run is what Nextflow itself actually launches, and that will use Docker and everything, and that should give you an identical error message. This is particularly useful to look at if you’re using sbatch or an HPC job scheduler, because over the top of that file, you’ll have the requests that were actually given to the cluster. If your cluster is rejecting your jobs because of weird memory or CPU requirements, you can check in there, look at the headers, and then manually debug that.

15:23 You’ve looked through all of this stuff. You still don’t really know what’s going on. Maybe you found a little nugget of text, which you think is the smoking gun, but you don’t really understand what it means. Now is the time to start searching. And the first place I always start is nf-core Slack. We’ve been using it for a few years. We have two and a half thousand users. There are, I don’t know how many tens or hundreds of thousands of vestiges in Slack. There’s a pretty decent chance that someone has come across this before and asked for help. The key is to search for the right thing, but once you’ve got that little nugget, stick that into the Slack search bar and have a look. Many tools and errors will span multiple different pipelines, and you’re probably not a member of every single pipeline channel. It’s really worth searching there, because maybe you hit the samtools sort error in, I don’t know, RNA-Seq, and maybe someone has hit the same thing in the Sarek pipeline. Searching all of Slack is really, really powerful. And then, of course, there’s also Google, and you can finally ask for help. This is just a few screenshots here. You can see that truncated file error, if I stick that into nf-core Slack, you can see there’s stuff, people talking about it in RNA-Seq, in Sarek, in viralrecon. Having a dig through there might be helpful. And of course, searching Google once you have the correct bit of text is obviously helpful.

16:50 You’re still stuck. Now it’s time to ask for help. There’s good ways and bad ways to do this. What I’m going to take you through is, as someone responding to help requests is what makes my life easy, which gives you the best chance of getting a quick and useful response. Firstly, if you can, pick the correct Slack channel to post in. We have lots of Slack channels. If your question is specific to a given pipeline, please ask in the channel for that pipeline. Because the people in there will know the most about that pipeline. If you think it’s to do with the config, post in configs and so on. If in doubt, you can always post in a help and someone will either answer you there or redirect you. Provide straightaway as much information as you can. This is really important and more experienced people tend to be used to this, but especially if you’re new to the community or new to bioinformatics, you can post the bit that you think might be wrong. But really out of context, it’s almost impossible to help. As a minimum, usually we’ll need the full command that you use to launch a pipeline and any Nextflow configs you use. Because that ties together to tell us what environment your error came from. Sometimes this can be quite a lot of output. If in doubt, post a short question or summary and then you can create a thread in Slack and then you can dump these outputs into there and it doesn’t float and flood the whole Slack channel. Use markdown code blocks, don’t just paste in your text from a terminal. You want to use those triple backticks to do a markdown code block. This is just purely code formatting, but it makes it much easier to read your message for anyone reading Slack. Very easy to do once you’re used to it and try to narrow down the issue as much as possible before you ask. Go through these steps we’ve talked through and come up with the best question as you can and tell other people how they can reproduce the error. Because that’s how bug fixing works.

18:59 The first thing I do if I have an error reported to me, which I think comes from a pipeline, is I go and see if I can get the same error. And once I can, then I can work on it and dig into it and make sure I fixed it. But if I can’t reproduce the error on my end, it’s very, very difficult to actually fix anything. These are some of the things if you fall foul of these requests, you might start to see these things come up. These are things where we’ve written the same thing a lot of times. Now we have little helpers within Slack. And so every now and then I’ll type more info and you’ll get a little Slack bot message, which says what I’ve just been saying, please tell us a bit more about how you run Nextflow. Please don’t feel offended if you get this, I send it to everyone. It’s just a little reminder, we’re probably going to need more information to be able to help you. We have one also about posting in a correct channel. And and if you don’t format your code blocks nicely, then there’s a risk that me or someone else might ask for better formatted ones in the future. I’m not really complaining. It’s just trying to help you out saying how to do this here, a couple of help pages.

20:10 Right. You’ve gone through that and the people on Slack can’t help you. Or maybe you think you’re pretty sure that you’ve encountered a bug in the pipeline code. Now is the time to move away from Slack and actually make an issue on the pipeline repository on GitHub. This is where we track problems, feature requests and bug reports so that they don’t get lost because it’s quite easy to lose things in Slack, it just disappears, out of sight when you’re a maintainer and you forget about it. If you make an issue on a repository, it’s there and it keeps all that discussion together. Please do hit bug report, click get started, and it gives you a template to fill in to provide all of the information that we typically need to be able to help. Title description and your terminal output, all the same stuff I’ve been hammering on about. And same stuff, give all the information that we’re asking for, try and narrow it down as much as possible and tell us how we can reproduce to error. If you think you know the solution, don’t be shy, say, I think it’s this bit here. And if you think you know how to fix that problem, even better, just make a pull request or make an issue followed by a pull request and just submit that fix yourself. Cause that’s the quickest way. And of course that that really helps to relieve the burden on maintainers as much as possible. That’s the way that most of us who write code within nf-core got into the community. We’re very open to pull requests always.

21:45 This is meant to be a short talk and I’ve already gone over. I’m going to wrap up at this point and let’s see if anyone has any questions or any suggestions or thoughts about how they do this and any cool ideas basically. Thanks for listening.

22:05 (host) Thanks. Yeah. Feel free to ask a comment or anything in the chat box or unmute yourself. Harshil says you can provide Nextflow logs along with command and configs and also error traces.

(speaker) Yeah. That’s what Harshil was talking about there is, I mentioned what’s printed to your terminal and that’s a really good start and that’s a lot easier to read. Nextflow also generates a log file called .Nextflow.logs. It’s a hidden file and that’s the verbose version of that log and it is massive and fairly difficult to dig through, but has all the information. If you can drop that file into Slack, that really helps to debug.

22:49 (host) I don’t know if there’s anyone who has another question or…

(speaker) yeah, James is saying that file that I mentioned .Nextflow.log that’s generated where you launch Nextflow. In the launch directory.

23:24 (comment) Mark’s is saying you can prefix a version of Nextflow to your command, so if you want to run your pipeline with a specific version of Nextflow, you don’t have to go and download - assuming you’re running somewhere with an internet connection - you don’t have to go and download a new binary or reinstall Nextflow or anything. You can just prefix the environment variable and NXF_VER= plus whatever version you want and then carry on with your normal Nextflow run command. Nextflow itself will automatically fetch that version of Nextflow and run with it. If you want to check that, whether it’s a problem to do with the version of Nextflow you’re running, it’s very quick to do that with that technique.

(question) You just put it before the Nextflow run command or you put it in your config?

(answer) No, put it before the command. If you prefer, you can also do, it’s just a regular bash environment variable. You can do export NXF_VER wherever you want, and then that will stick around for the whole of your terminal session. But if you just want to do it for a single Nextflow run, you can just prepend it before the start of your Nextflow run and it will be used there. I’m not sure if Nextflow carries on using that version afterwards or not. You might need to run Nextflow self-update again afterwards, if you’ve gone back.

(host) Thanks.

24:57 (host) Looks like there’s no one else with a comment or a question. Thanks guys. And we’ll see you next week for another bytesize talk. Thanks Phil.

(speaker) Thanks very much everyone.

Bytesize 35: Troubleshooting a failed pipeline Edit

Bytesize 35: Troubleshooting a failed pipeline
Edit