Join us for our weekly series of short talks: nf-core/bytesize.

Just 15 minutes + questions, we focus on topics about using and developing nf-core pipelines. These are recorded and made available at https://nf-co.re , helping to build an archive of training material. Got an idea for a talk? Let us know on the #bytesize Slack channel!

This week, Phil Ewels (@ewels) will talk about MultiQC (of which he is the author). He’ll cover customising reports to have consistent branding, additional rich content and more.

Video transcription
Note

The content has been edited to make it reader-friendly

0:01 (host) Hello and welcome everybody to another talk of the bytesize talk series that is offered by the nf-core community. We should mention that it is receiving support from the Chan Zuckerberg Initiative. We’re thankful for that. Today Phil Ewels is back and he will tell us more about MultiQC and how to customize MultiQC reports for example, for your own pipeline. Thanks for joining us, Phil, today.

(speaker) Thank you for having me.

0:33 Sorry that I’ve ended up doing two bytesize talks in two weeks. It’s been a bit of a reschedule shuffle. Hopefully you won’t be too tired of my voice already. Today’s talk is a bit of a break from what I’ve spoken about previously with bytesize, in that it doesn’t really talk about nf-core at all. This talk is purely about MultiQC which is one of my other pet projects which I’ve been working on for a few years now. But MultiQC is used very heavily within the majority of nf-core pipelines. We figure it’s a relevant topic for most nf-core developers certainly, but also people using nf-core pipelines as well. Today I’m gonna start off with a quick introduction just for those people who might be watching who have no idea what MultiQC is. Then I’ll talk about a few tips for people developing pipelines and recommendations how to get the most out of MultiQC, and a few recommendations for people who are running nf-core pipelines. Usually, this is most relevant for people working in facilities or large scale routine processing places. But of course it can be used by anyone.

1:41 What is MultiQC? Basically, MultiQC is to help this little guy who’s sad wading through the hundreds and hundreds of text files at the end of his or her analysis, all these log files in the terminal, trying to work out whether the analysis worked or not. Also trying to work out if there are any bad samples in his or her project. What it does is it takes all of those text files and it visualizes them within a report. You get a nice shiny graphical thing that is more human readable and you can see at a glance - hopefully - how everything’s gone and if there are any samples which might need a closer look. It supports, in a single report, multiple different bioinformatics tools, 115 or something like that we’re at at the moment. The vast majority of commonly used bioinformatics tools are represented out of the box. And it also handles multiple samples. If you have five samples in your project or 500 samples in your project, MultiQC will suck up all those different log outputs and summarize all of that into one single report for you. As well as the HTML report that it generates, MultiQC also spits out a bunch of other files which gives you a nice standardized output.

3:07 Bioinformatics tools are famous for lacking standards in file formats. MultiQC does some of that legwork for you and it gives you tab separated files by default, but you can have YAML or JSON as well. All the different 115 bioinformatics tools will produce output, which is in roughly the same flavor. It’s useful for downstream processing as well. MultiQC is written with Python, so it’s pretty easy to install if you’ve got Python set up on your system using a Python package index, here: pip install multiqc. It’s also in Conda, or you can use it with Docker or Singularity, there’s a Galaxy wrapper for it. Most places you’re already running software, you’ll find MultiQC there. There’s a Debian installation and all sorts.

3:55 To run MultiQC, you call the MultiQC command, which is the tool name. It needs a minimum of just one argument, which is a file path. In this case, I’ve given it a dot, which just means the current working directory in the terminal. MultiQC then will recursively look through every file and folder in that path and see what it can find. Anything that’s not a log file, or that it doesn’t recognize, it will just ignore. It’s been designed from the ground up to work with analysis pipelines, where you have all of your results in a folder, and then you just run multiqc [folder], and it will find what’s relevant. If you want, it can also take explicit file names and as many different paths as you want, if that’s better for your setup. That’s it. Once you’ve run MultiQC, it will tell you it’s generated an HTML report, and then it’s up to you, the human, to do the difficult bit, which is to look at that report, understand what it’s telling you, and continue with your analysis.

4:50 I started MultiQC back in 2016, and it has wildly exceeded my expectations. I was looking this up yesterday. If any of you follow me on Twitter, you might have seen, it just passed 2000 citations now, paper citations for MultiQC paper, which is just utterly mind-blowing. I certainly didn’t set out with any expectations of this. It was just an internal tool that we needed at SciLifeLab for our own internal QC. It’s very humbling that MultiQC has reached and helped so many people. There you go, 114 different bioinformatics tools supported, more coming in all the time. Those citations, I find quite terrifying, if I’m honest. You can see the graph is going up and up, and it makes me always very scared to push a new release, because I always think there’s always people using it. What happens if I’ve broken something? Or worse, what happens if I find out that something has been broken for the past three years and all these citations are wrong? But anyway, that’s for maintainers’ nightmare.

5:57 If everyone’s slow to respond to you, if you’ve opened an issue or a pull request at MultiQC, this is my defense. We’ve had just over a thousand issues created on GitHub now for MultiQC. It’s nearly 150 of them are still open that need closing. There’s been over 500 pull requests, so people contributing code. People’s contributions account for the majority of tools supported now. It’s really a collaborative effort, though I’m the gatekeeper and I hold all the keys. It has to get past me to get into MultiQC, but most of the code is not written by me anymore. Again, there’s always a long list of pull requests open, because it takes me quite a long time to go through them. That sounds a lot. It is. I worked out how many days it’s been since the first commit to the MultiQC repo. It works out at about one issue every couple of days. It’s a lot to go through. Please, please be patient. I do my best. Right, that’s the introduction.

6:55 You’re happy with what MultiQC is. You’ve written an nf-core pipeline. MultiQC is working, but what tips and tricks can you do to really squeeze the most out of your MultiQC reports? An easy one to start with. All of this, everything I’m gonna describe, is in the documentation by the way. Go to multiqc.info and you’ll find all of this and a lot more. I’m mostly just gonna pick out a few things for you to go and look up if it sounds interesting. But anyway, an easy one to start with is optimizing how fast MultiQC is to run. Generally MultiQC runs within a few seconds for most things, but if you are running a lot of modules and if you’ve got large numbers of samples, it can start to take a few minutes or in extreme situations up to an hour. It can be nice to try and tune that optimization as much as possible. There are a couple of things you can do very easily to do that.

7:47 Firstly, I would recommend running MultiQC yourself with this extra command, profile-runtime. That will actually add an extra section to your reports. MultiQC has an introspective look at itself and works out what it’s been doing. In the log, it will tell you how long it took to run and how long it spent doing different things. In this example here, you can see the vast majority of the times were spent looking through the different files that was given and trying to find which ones are relevant. Actually then once it had that file list running the modules and generating the report was quite quick. Within a MultiQC report, we get plots like this, which tell you how fast or how slow different search patterns were within MultiQC. MultiQC has a bunch of different ways to find relevant input files. The simplest is by a file name pattern. If a tool always gives the same suffix for its output files, they’re dead easy to find. You can just search through the file list and find them that way. But many, if not most tools don’t do this. It might just be a standard output log to a terminal or you can call your summary file, whatever you want it to. Then MultiQC has to look within the file contents to find those files. That can be a bit slow. Picard here, you can see, is one of the worst culprits often. It’s got lots of different outputs it can find. There’s lots of different search patterns and for each one of these it has to look through each one of your files to see if there are any matching strings. Here you can see, what was run, what are the main culprits in terms of slow searching and then you know what to focus on.

9:28 Once you’ve figured out what’s actually taking time, what do you do about it? Firstly, especially within the context of writing a pipeline, it’s very easy to tell MultiQC, you’re only gonna get output from these different tools. Don’t bother looking for a Picard output because I’m not running Picard. That speeds up things quite a bit. Then you can optimize those search patterns I mentioned. Firstly, lots of modules have sub modules. Picard is one tool but it has about 15 different kinds of sub tools. You can disable search patterns for the stuff that you’re not running. Also you can use file name patterns. Maybe the tool doesn’t have a constant suffix but maybe within the pipeline you do always have a predictable file name. You can tell MultiQC to use that file name to find files instead and overwrite the default file name search pattern and that can speed things up a bit. There’s a section of a documentation I’ve linked to here which talks to all the same stuff. Go and take a look if that sounds interesting. Okay, that’s the boring stuff. That’s just like optimization.

10:37 I had a quick look through a couple of nf-core pipelines to see what was frequently set within a MultiQC configs. I’ve split up a few common things which makes sense. In the next slide, I’ve got some stuff which I haven’t seen so much of, which might be nice. Let’s start off with the common stuff. One of the most frequent things that people want to do is change the default order of the different sections within a report. That’s quite easy to do. You have a config file in the YAML and you define this key top modules and you say, these are the modules I’m most interested in, in this order and MultiQC will run those modules in the order you specify. It will still run everything else after that. If you just want FastQC at the top, you just do top_modules FastQC and that will float to the top. If you want some more nitty gritty detail, you can specify the module order config which has a whole bunch of different sub keys. This, again, you can use to order the modules. You can also use it to run a single module multiple times with a sub file name filter. This is most commonly used for, for example, FastQC. If you’re running FastQC twice, before and after trimming, you can tell MultiQC to run the same module twice but on a different subsets of files. Again, you can also overwrite things like the title of the module and a bunch of other things in here.

12:04 One of the most difficult things that MultiQC has to do is work out the name of each sample. There’s no idealized situation where we just magically know what your sample identifiers are. We have to do our best guess. Usually that’s by looking at either the file name of the log or trying to find the input file name and basing it on that. But of course, if you have .fastq or .bam or whatever, you have all these different extensions then they look like different identifiers. MultiQC tries to get rid of those standardized extensions so that you end up with that core identifier and then everything lines up nicely across the different modules, especially in that top table called general statistics. But it’s generalized so we have to do our best and sometimes different pipelines have different extensions which are added on. If you see that happening, especially in general stats that rows aren’t lining up or you see duplicate samples, which should be just one, you can tell MultiQC what your custom extensions are in this config and clean them up. You get really nice clean, short sample identifiers with no additional cruft.

13:15 Some people get really annoyed. MultiQC has to deal with massive numbers of samples - everything say from one or two samples up to thousands - and tables get really unhelpful when they’re super, super long. You can no longer summarize and take an overview view, which is the whole point of MultiQC. By default MultiQC, when a table gets to – I think 500 rows is the default, something like that – it will, instead of doing a table, generate what’s called this beeswarm plot which is like a dot plot. If you find that really annoying, you can push up that threshold at which that switch happens to effectively disable beeswarm plots. A few people have done that within nf-core pipelines.

13:56 Here’s some stuff I didn’t find, which I thought might be nice to have. Take note developers, even if you think you already know everything there is to know about MultiQC. One of the things MultiQC does by default at the top of every report, it says when you run it and it shows the input files that you gave it. The directory where you told it to search for files. Now for Nextflow, because analysis always runs within temporary work directories, usually the place it runs is not really very interesting at all. It’s just gonna be work and then some long hash identifiers. It might be nice just to turn that off and you can just set show_analysis_paths to false and MultiQC will not print that at the top of the reports. By default in the templates, for nf-core template, we have a report comments at the top saying this report was generated by this pipeline, but you can also go further than that. You can add comments to specific modules within your reports and you can add as much or as little detail as you like here. This is a great way of documenting the results of your custom pipeline. We have the documentation on nf-core website, sure. You can embed stuff within the report here so that when people are reading through, you can say in this pipeline, we’re running this tool in this way and this is what you should look for. More documentation is always better. Let’s see some section comments in there and that’d be great.

15:23 We don’t really ever seem to customize the report logo. I was thinking that would be something easy to do. Stick in the nf-core pipeline logo up at the top of the report if we wanted to. Customizing plots themselves. MultiQC is going to be very extensible and very customizable and that extends to every single plot. If you know the identifier for the plot that you’re interested in, you can tell MultiQC, actually, I want this to be the title. Actually, I want the axes to be this axis labels. You can customize pretty much every aspect of the plots, even when they’re coming from a built-in module. You might be able to tweak certain things here and there to make them more understandable, better suited to your outputs. On a similar line, you can also customize the tables. Maybe you have percent duplicates reported twice in two different tools, anywhere you want it once or something is not useful because of this or that, you can tell MultiQC to ignore or hide certain columns within your tables, which might be good.

16:28 Something else which is used quite a lot within nf-core and actually has been a wildly successful feature of MultiQC, is the ability to inject custom report sections without needing to write a module. Without needing to write any Python code. This is called custom content and would typically be something like output from pipeline scripts. Maybe you’ve written a custom R script or Python script within your workflow. It’s not a general tool outside of the pipeline. If it was, it’d be better to write that as a MultiQC module so that everyone can benefit from it. But it’s just like a really specific niche thing. Then you can generate and you have control of the output. Then you can insert that into the MultiQC report using custom content. It can be a config file, it can be JSON, it can be custom HTML, it could be images if you want. Now I generally dislike having images in MultiQC reports because they really bloat the HTML file size. If you do images, please make sure you don’t have one per sample because quickly that will just crash the browser that tries to open the reports. All you have to do is append to your file name _mqc.json or YAML or whatever the file format is. As long as your file content looks roughly right, MultiQC will try and figure out what to do with it. You can also configure lots of stuff. Again, you can tweak and make all the plot axes and titles exactly as you want. Different ways to do that with different file formats, check the documentation and especially check this repo which has the test data which MultiQC uses. Custom content is difficult to document because you can do anything. You can’t document everything. But what I do have in this repository is lots of different examples that I’ve made over the time. You can dig around and find different ways of doing things and modeling your custom content on that.

18:26 That was all for people developing pipelines. What about if you’re running an nf-core pipeline? What can you do to tweak your own personal MultiQC reports separate from the rest of the nf-core pipeline community? Basically all the nf-core pipelines, because it comes in a template, has a parameter for the pipeline called --multiqc_config. Using that, you can give a custom YAML file. It’s important to say that this is additional to the config which ships with the pipeline. The pipeline might be doing its own configuration stuff and then you can add your own config on top of that and they work together. You can do stuff like conditional formatting, for example, that is something we use at the NGI. In your house, if you’re running the same pipeline for the same data type, you might say samples fail if they have under 80% alignment. I want to flag those so that they stand out nicely with red here and maybe warn stuff which is between 80 and 90% alignment. Dead easy to do, for any table in the MultiQC report you can have these conditional formatting rules and you just get the identifier for that column and set up the different rules.

19:39 You can add project level information. If you are generating MultiQC reports from LIMS for example, well, or you have your own custom analysis you might want to say, okay, this project was called this and you might want to add some comments about what exactly it was that you did or even put in different custom sample names which are different to the identifiers that MultiQC finds. I’ll show an example of this in a second. You can also style the report. You can put in a custom logo, as I mentioned earlier. You want to have your Institute logo in the corner of MultiQC report? No problem. You can actually now, as of last year’s release, just have a custom CSS file. If you know a little bit of web development you can style stuff completely differently and have different background colors and just hack on the default template for MultiQC quite easily with a little bit of additional CSS. If you want to take it a step further you can actually develop your own entire template and supply that to MultiQC. Different ginger template and really change what goes into the report and how it’s rendered.

20:51 Quick example of some customization. This is an example report which you can actually see on the MultiQC website. If you go to the top menu on the examples it’s the NGI one. This is taken from the reports we generate at SciLifeLab at the NGI where I work. These are some of the things that we’ve done in our config to add additional information into the report, which is useful for our users. This happens again on top of the nf-core pipelines. The most obvious one is we add a title. In this case, we have a project identifiers and a nice title and that’s done with a config attribute title. We have a subtitle under there with a little bit more information. In our case, I’ve removed identifying information here but this would normally be, we have a project title where the PI has said what the project is about. Here we have report comments, which is similar but just longer format, slightly different styling. This comes from the pre-nf-core because this example is pretty old. The nextflow pipeline has added this but you could customize this to be whatever you want with reports comments. We’ve put in a logo and also with that the logo as a URL and a title. If you hover over it, it says the title and if you click it, it will take you to the custom URL which in this case is the homepage. We’ve got this little panel here of custom information which is called reports_header_info and this can be any key value pairs you want. This ties in really well with a LIMS, if you have custom and report level information that you wanna show just to summarize information.

22:33 You might also notice there’s a couple of extra buttons up at the top here. That has been done with something called --sample_names, where you give MultiQC tab separated file with all your expected sample identifiers and then alternative sample names. The column head is then four buttons at the top. If I click, in this case “user supplied names”, that’s something custom I’ve labeled it, then you see all the sample names down there switch. We by default have NGI identifiers which is what’s useful for us. But then our end users might not really know what that is. They can click that button and see all the sample names that they supplied to us really quickly, really easily. All that does is just pre-populate the MultiQC toolbox really quickly with lots of different sample matches. Dead easy to do and can be very, very helpful. Of course, this is an example of going to town with customizing your report output, just to give you a flavor for what’s possible if you really, really go for it. This is a little Easter egg in MultiQC. See if you can take that out —template. Okay, I won’t be too much longer. I’m running over a bit, sorry.

23:49 Looking to the future, a couple of things to look forward to with MultiQC. Those of you have heard me talk before might recognize some of these slides here. Most of this stuff has been planned for MultiQC since about 2018 or 19, which by coincidence is around the time that another one of my projects started taking off around that one is called nf-core and sucked up some of my time. Anyway, this is stuff which is being actively worked on and will happen. It’s stuff I’m excited about. To kick us off is refactoring the code base, so that it works more as a Python library rather than purely a command line tool. Now if you want to, if you’re using Jupyter notebooks or custom Python scripts, you can import MultiQC and you can run it like this, in a programmatic way on a folder and it will generate the reports. What you can’t do yet is generate a MultiQC reports object, and then pull out specific stats and specific plots on demand and that use all that internal functionality that’s there. At the moment, that’s a bit tricky, but I’m hoping to get there soon. It’ll be a really useful interactive or script-based analysis tool as well as a command line tool.

25:09 The other big one is MegaQC, which is my poor forgotten child that has been a bit abandoned, but despite my best efforts to ignore it, it is being picked up by others in the community and is being actively developed by a small but slowly growing core of end users across the world. Michael Minton in the States is probably one of the key contributors and also core to the Northern and Norway. Anyway, MegaQC, what does it do? When you run MultiQC, you get one reports objects and that’s frozen in time. You’ve got the samples you run it on in your project and that’s it. But many people are running in a facility doing clinical work or whatever, you’re running MultiQC the whole time, hundreds of times a day and you’re generating this longitudinal data and you wanna track things across projects. You can’t do that in MultiQC alone, but this is a companion tool, MegaQC, which is like a regular running web server tool. MultiQC, when you run it, it can spit out the results to this tool as a JSON file over an API. All that is then stored in a database for you to interactively query view and plots. This is quite an old demo I did for a talk a while ago, but this shows pulling plots, which I’ve set up in a MegaQC and saved as favorites. It has an interactive tool for generating dashboards. This is really cool. Like you wanna have a TV up in your lab or something showing statuses so you can keep a track on whether the trend lines are working properly or whatever. You can really quickly drag and drop a quick dashboard together with your favorite plots and whip it up. That saves and then you have like a static HTML webpage, which you can then load and play around with. You can see the different types of plots here. We’ve got single values plotted against one of the bar graphs, distributions, all sorts. You can really get the most out of all the MultiQC data, which is being found in your samples and visualizes it and interrogates it. That is sort of ready to go now, but it’s still being actively worked on in a big way.

27:23 Right, with that, I’ll wrap up. I’m happy to take any questions. Check out the MultiQC website. Like I say, all of this was documented. Have a read through there, see if you can find anything new. All the code base is open on GitHub and there’s a gitter chat for MultiQC, which is a good way to get my attention to the quick questions. I’m happy to respond there. Thanks very much for listening.

(host) Thanks a lot, Phil, for this introduction to MultiQC and showing also advanced tools and characteristics of MultiQC. I’m sure we all learned something today.

28:01 (question) We do have one question in the chat. They were wondering about this example that you showed on quickly changing sample names. What configs or files would we need to generate to actually change the sample names?

(answer) Right, so you can do it a couple of ways. This is off the top of my head. I think you can do it in MultiQC config, but the way that I would recommend doing it is with this option, this flag --sample_names when you run MultiQC. It’s a tab separated file, where the first column should be over identifiers, which MultiQC itself is finding. In this case, you know, we run with the LIMS, we know when you run MultiQC, these are the samples we expect in this project. We know those identifiers. In the next column along, you have the equivalent names on the same row and each column will get its own button along the top, which we’ll then be able to switch through. Thinking about it now, this might be slightly difficult to do within nf-core pipelines, because this is an additional file and flag to provide to the MultiQC module. You might need to look into doing that within the YAML file, within the config file, which you can give to MultiQC. I’m pretty sure you can do it, but I would have to check to be certain. If you can’t, then maybe let us know and we can look at either putting that into the nf-core module, or I can look into whether it’s possible to do with a MultiQC config file.

(host) Thanks a lot.

29:41 (question) We have also another question by Moritz. Any recommendations for large Nextflow pipelines and MultiQC? Usually we use the collect to mix everything and pass it to MultiQC. But however, this can sometimes crash with many samples.

(answer) Yes. The way that Nextflow works has always been a bit ugly for MultiQC, because Nextflow is very explicit about your files and you need to stage them as inputs and everything. Whereas MultiQC works really nicely when you’re running it interactively and you just have a folder and you run MultiQC, but with Nextflow, you need to be really explicit about staging those inputs. The short answer is no, I don’t have anything better than that, I’m afraid, because you need to stage them. I’ve talked to Paolo about this various times over the years. We’ve discussed ways to make it easier, but not really ever come up with anything better. MultiQC itself, if you want to give explicit file names and there’s very many of them, people have run into problems with dc biome and with Galaxy and stuff with this, where the command line gets so ridiculously long that it crashes bash or whatever environment it’s running. In that case, you can put all the file names into a single text file and then do --file_names textfile, and it will go through all of those. But that still doesn’t really help with Nextflow because you need to still stage those files as inputs. You have to declare them as inputs. Yeah. Sorry, that’s the best I’ve got.

(question cont.) Right, so in that case, it’s not possible to just parse the whole folder?

(answer cont.) You probably can do that, yes. I’m mostly thinking about… that’s a good point. I’m mostly thinking about lots of different processes because you need to stage each one of those process outputs in. But you’re right, if you have lots of different files, then you can certainly just stick them in a directory and parse that one directory, as long as it gets staged as an input. MultiQC command is dead easy, just do multiqc . because you’re working within that isolated work directory. That should work fine.

(host) We should explore this for nf-core pipelines then.

31:57 (host) Okay, if you have any other questions, you can also go ahead and unmute yourselves. I’ve just given rights for that. In the chat so far, we don’t have any other questions.

(speaker) I’ll pop these slides up on… sorry…

(question) How long did it take you to make the 90s mode of MultiQC?

(answer) I did that back in the early days when I had lots of free time still. Actually less time than you would expect because the default template is rendered with Bootstrap, a CSS framework. Someone else had already made a Bootstrap theme using all the right class names and everything called geocities, if you’re old enough to know what geocities is. I hijacked that and then just added on a bit of extra flair on top. It actually wasn’t too bad and it’s nice. I do like sticking easter eggs into software tools. Bit of MC Hammer never goes in this.

(question cont.) Okay.

33:04 (speaker) What I was gonna say is I’ll put slides up as a PDF onto the Slack channel, on the bytesize Slack channel.

(host) Yeah, perfect. Seems like there’s no other questions.

(speaker) Thanks again. As you mentioned, the slides will be uploaded and the talk available also. We can continue any further questions on Slack. Thanks a lot, everybody.

(host) Thank you very much.