Join us for our weekly series of short talks: nf-core/bytesize.

Just 15 minutes + questions, we focus on topics about using and developing nf-core pipelines. These are recorded and made available at https://nf-co.re , helping to build an archive of training material. Got an idea for a talk? Let us know on the #bytesize Slack channel!

This week, Chris Cheshire (@chris-cheshire) will present: nf-core/cutandrun

Chris is going to give an overview and highlight the new features in the recent release of nf-core/cutandrun. Nf-core/cutand run is a pipeline for CUT&Run and CUT&Tag experimental protocols that where developed to study protein-DNA interactions and epigenomic profiling.

Video transcription
Note

The content has been edited to make it reader-friendly

0:01 (host) Welcome, I’m Franziska, I’m today’s host, and with me is Chris Cheshire, who’s giving us an intro on the newest developments in the nf-core cutandrun pipeline. And it’s all to you now.

:15 Thank you very much. I will just share my screen. Is that all good? Can you see my screen?

(host) Yeah. Yes, we can.

Okay. Hello, everyone. Thanks for the introduction. My name’s Chris Cheshire, and it’s a pleasure to be here. That’s probably the first time that I presented this pipeline when it’s consumed like a year and a half of my life on and off. I’m extra happy to be showing it to someone today. That’s good. The run today, I’m going to go through the concept of the pipeline and the reason it was developed. I’m going to walk through some of the key features with you, discuss some of the more interesting points of the pipeline, it’s not going to be exhaustive. Then I want to go through some of the new features for the version 2.0 release, which I’m going to shamelessly plug throughout this presentation. I want to go through some of the testing and automation features, which I developed, and also, of course, some of the future plans of the pipeline. Without further ado, a little bit about me.

1:30 I’m a postdoctoral research fellow at the Briscoe lab, James Briscoe’s lab at the Crick Institute, which is in London. My focus is single cell multiomics. I’m trying to develop a system or rapid prototyping single cell system that will target ATAC, multi-target cut-and-run or cut-and-tag, and transcriptome - all simultaneously from the same cell. But at the time when I first started this project, I knew nothing about cut-and-run, and despite the fact that it’s an integral part of the pipeline, other than the technique and the results that it produces. I thought the easiest thing to do would be to find a project that got me into cut-and-run or cut-and-tag data analysis. This was it. I realized that there was no nf-core pipeline. I’ve been using nf-core for a while for some other projects, and I realized that there was no nf-core pipeline for this experimental protocol. That’s where it went from there, really, and it just snowballed, but it’s been a really interesting project to work on and still ongoing.

2:41 Just a brief overview of cut-and-tag or cut-and-run. The idea is that it’s a successor to the ChIP-Seq assay, and the main difference with ChIP-Seq and these ones is that you get much lower background binding and non-specific cutting. The basic protocol is that we wash antibodies over either a target transcription factor or histone mark, and then we attach an enzyme which has a protein-A binding site that’s been attached to it. The protein-A binds to the antibody, and then the enzyme just hangs around attached localized into this targeted area where the TF or the histone is marked. Then you give an ion. In the case of MNase here, it’s calcium, and that activates the enzyme and causes it to cut in open chromatin or on the nucleosome around where this target was. Then you can get rid of everything and sequence the products, and you get a very accurate position down to the nucleosome level of where the TF or the histone was.

4:06 Now the difference between cut-and-run and cut-and-tag is that cut-and-run uses MNase as the enzyme that cuts, and cut-and-tag uses transposase as the enzyme, and you use magnesium instead of calcium as the activating ion. They have, I won’t go into it, but they have different advantages. There’s some evidence to show that cut-and-tag is better for transcription factors than cut-and-run, but that’s all outside the scope of this presentation. The key thing to note about it is that the bioinformatic processing upstream is exactly the same for both protocols. That’s another reason why I wanted to do this pipeline, because it kills two birds with one stone. You get two pipelines for the price of one. I thought that was pretty cool. These approaches are really growing in popularity, especially in the Crick, but I think globally as well. And, as I said before, there was no nf-core pipeline for this.

5:04 Overview of the pipeline. This kind of diagrams are becoming popular now, the tube map diagram, and we can see here the general flow of the pipeline. I’m going to go through this bit by bit, so I won’t spend too much time on the slide. In general, we have a trimming and QC at the beginning, and then we have alignment in the middle here, and then we need to gather up the reads into peaks and remove duplicates, filter, do things like that. Then at the end, we finally call the peaks, and then we do a bunch of reporting going all along the way.

5:46 The first bit I want to talk about is the sample sheets. The reason I’m talking about the sample sheet, I wouldn’t normally discuss this, but it’s actually one of the new features in the pipeline. I wanted to just touch on it. This is the new version of the sample sheets. The sample sheet allows you to define where your samples are and what the structure of the experiment is going to be. This is very similar to all the other nf-core pipelines around. It’s half standardized, I feel. You can merge technical replicates or merge data from multiple lanes, and this is a feature of most nf-core pipelines, and you do that by having the same sample ID and the same replicate number and then these - like in the top two rows here - and these two will then automatically be merged together as one sample, which is really useful when you need to get sequencing for multiple lanes and things like that, which happens a lot.

6:41 The other main feature is that we can assign control groups, and control groups are really important in cut-and-run and cut-and-tag. You almost always have an IgG background control for those experiments, and so the ability to assign that in various different ways is really important. This pipeline can auto detect when there is a control being given, and that’s detected by the fact that it’s being used as a control in the final column here in one of the other samples. The other thing to note is that controls are automatically assigned as per their replicates. We have the wildtype here, which has one replicate or two replicates, one and two here, and even though we don’t explicitly assign replicates, one and two will be assigned to one and two here, which is quite useful. Also if we just applied, had one control group there, then the control group will be applied to both, which is also useful because sometimes you don’t have multiple replicates of IgG. The other main feature of this is that it’s got some robust error checking in it, which again is a requirement and a feature for most nf-core pipelines. I went over that because that’s changed from the previous version, and I’ll just highlight that again later, but that’s the sample sheet checking.

8:01 The next stage of the pipeline is the trimming and the initial quality control, as well as the merging of the samples together. This again is standard for a lot of bioinformatic genomics pipelines. All sequencing machines, or Illumina sequencing machines, require adapters, and most people sequence on Illumina, and these need to be trimmed off. This is standard, and you need to do QC before and afterwards. But the reason I wanted to touch on it is because I wanted to touch on how the pipeline is designed and the design principles of it. There are many paths for downstream analysis, as we all know in genomics. Once you get to a critical point, then the paths of analysis diverge depending on the scientific questions that you want to answer. But the upstream analysis, the point at which it diverges will always be the same. And so with this pipeline, instead of providing a load of features for downstream analysis that are difficult to test, because they’re situation specific, I wanted to really focus on the upstream data quality.

9:11 For this pipeline, that critical point is when the peaks are called. I want to produce really robust peaks that you can trust that are supported by a lot of quality control and transparency around how those peaks were calculated. That was the main aim of the pipeline. And because of this, this enabled a proper development cycle for the pipeline, where we can test and integrate new features, produce maintenance, analyse the new features out in the world, and then design new features and implement those in a circle. If we were having to test downstream analysis routes all the time and stuff like that, this cycle would break down.

9:53 Going on to that, the principles the pipeline was designed around was repeatability. It needs to not fail, it needs to do the same over and over again, and to be able to trust it, it needs to be reproducible, which are core principles of nf-core as well, but they need to be reproducible. This is what nf-core enables and Nextflow. We can run this pipeline on clusters, laptops, doesn’t matter, it should run the same as long as you have some key minimum installation requirements. The other two of the two that I was talking about just now is it needs to be transparent. We need to know where the results came from. And we needed to get insight into those results.

10:31 The way that I’ve done that is through providing lots and lots and lots of reporting. You can see here on the diagram, the little stops that have a pie chart in. These are all the points in the pipeline where reporting is produced. This reporting is in the form of charts, tables, and various other things, multiQC reports, if you guys are familiar with that. This really just allows someone to get a really good view on exactly what’s going on at every stage in the pipeline. And if something is not clear, then that’s a problem, and we try to fix it as quickly as possible.

11:06 Onwards to the main function in the pipeline again then, to the next stage after this is alignment, I won’t go too much into alignment. It’s using Bowtie2 and its standard alignment procedure. There are some interesting parameters that we describe in the documentation as to how Bowtie2 is run with the reads that you get from this type of experiment, but that’s outside the scope of this presentation. After that, we go on to filtering. We filter out reads which need to have a minimum Q score. And also we remove duplicates from some of the reads, but not all of them. One of the key things to note is that normally you would just remove duplicates, whatever, because you want to get rid of all the PCR duplicates. But the trouble with cut-and-run is that, because it’s targeted, you get this very close stacking of reads over the same sites. And so even though this is valid data, depending on the parameters of the duplicates, you may find that this gets filtered out when it shouldn’t. We don’t remove duplicates on the target samples, unless there is clear evidence of PCR duplication, that’s too heavy to ignore. Then you can, of course, turn it on in the pipeline.

12:26 This is all standard stuff. Something I really want to talk about is the read normalization. This is something that we changed in version 2.0. One of the main stages of the pipeline is that the aligned reads are stacked up. Then we get what’s called a bed graph out of it, which basically shows you for each region how many reads stacked on top of each other. You can imagine this creates a histogram. This histogram is what’s used when we call peaks in various peak callers in Macs2 or Seacr. These peaks need to be normalized in some way. There are quite a few different sources of normalization error in these experiments. The first is experimental batch effects, if you used different enzymes, different antibodies, different batches of antibodies and things like that, they can produce different results. That’s outside the scope of the pipeline. It’s quite difficult to fix that once you get to the bioinformatic stage in this class of experiments. I’ll move on from that.

13:35 The other really big thing that we need to account for is epitope abundance. Some epitopes that you target, such as some histone marks, are really quite ubiquitous across the genome. And some rarer transcription factors are much more targeted. And so you’re going to get traditionally less reads associated with those lower abundance epitopes. Yet they’re just as important if you’re trying to compare them. One of the main tasks that we have to do is to normalize between them so that we don’t get tiny, tiny little peaks or no peaks called for this low abundance transcription factor when we actually do want to detect those sites. The original way to do this was using spike-in normalization. This is back from the ChIP-seq days. The spike-in is some E. coli DNA that’s left over from the process of producing the protein, the enzyme either MNase or transposase. The amount of the epitope and the amount of the spike-in DNA that’s present, and if you keep the amount of the enzyme constant, that decides how many cuts you get on the E. coli DNA versus how many cuts you get on your target genome. And you can use this to normalize against how much of the epitope was present.

15:13 But there’s some big problems with this. Number one is that the newer cut-and-run and cut-and-tag kits are processed so that you don’t really have very much spike-in at all in the kits left, it’s all been cleaned out. That was a big problem, we’re starting to see that a lot in the pipeline. People coming to me talking about these projects, that they can’t normalize properly against spike-in because there aren’t any. And some people have realized this and have started to spike-in their own DNA, but that comes with its own problems with getting the correct amount spiked in and stuff. The other thing that’s required when you’re looking at epitope abundance and normalizing with spike-in is that you need to have the same amount of material, the same amount of cells in the experiment in order for this normalization to work. That’s just not the case in a lot of experiments, especially with tissue samples and things like that. You just can’t guarantee that. And so again, we’re seeing that this normalization is really hard to achieve.

16:12 In the new version, version 2, we’ve started to include options for normalizing against read counts and read depths across the genome using deeptools. And we found this to be quite successful so far. It’s not as complex as normalizing against spike-in DNA, you’re literally just normalizing against the read depths between different samples, which obviously if you’ve got different abundances of epitopes, that’s going to cause other problems, but it’s better than no normalization and it’s proving successful so far. But there’s quite a lot of manual tweaking involved. I wanted to highlight that these are the main questions we’re thinking about in this pipeline. This is not finished. We’re going to carry on trying to work out what the best way of getting the most robust, trustworthy peaks from the pipeline.

17:07 Now, I’m aware that I’m probably running out of time. Yes I am, so I’ve got to move a little bit quicker. The final major stage of the pipeline is that the recall peaks. Again I wanted to highlight this because the old peak caller, Seacr, which is produced by the Henikoff lab who developed cut-and-run and cut-and-tag. Some people were having some issues with it or just wanted to use Macs2, which is the standard peak caller for high background noise experiments like ChIP-seq and ATAC-seq. We included Macs2 as an extra peak caller, and you can actually run both peak callers in parallel together if you want in the pipeline to compare the results. That’s another major change. The last stage of the pipeline is to give us some really trusting peaks. You know, we’ve tried to normalise as best we can. When we call the peaks, we call them against the IgG background if it’s provided. That’s another form of normalisation. Then we also can do consensus peaks. How many of these peaks are present in our replicates? And we can be stringent if we want, and so we need all the peaks present for this peak to be trusted. As you can see, that’s what we really concentrate on is trusted peaks and transparency using the reporting.

18:28 Key feature summary for version 2.0. Now this version is not out yet. It’s going to be out in the next few days, hopefully. I’m trying to find time to go through all the final changes before it can get approved for release, but hopefully next week this will be released. The sample sheet system redesign, we’ve got additional read normalisation options, which I’ve just been through. We’ve got additional peak caller options, and we have loads of bug fixes and performance optimisations and things like that. Another shameless plug of version 2.0. Go ahead and use it, and please do let me know if there’s any problems. I think I’ll just touch on this very briefly because I’ve got to finish. This is just a note on testing. I basically took what the tests do. This is for the pipeline developers out here, but I took the testing that we do in nf-core modules with the YAML testing with PyTest, and I applied that to the pipeline. And we now have 213 tests that run using PyTest for every code change that we make on the pipeline, and I think it’s really made the pipeline a lot more robust, especially because it’s just me working on it, or there’s just a couple of us working on it. I really think that was important, and please, if you have any questions, if you’re developing pipelines, got any questions, come and contact me, because I do think this is quite a good advantage.

19:47 Finally, news and the future, the version 2.0 release is imminent, as I’ve already said. We really need developers. It’s just me and another woman called Tamara working on it, and we really want to push these features forward, but we need you guys in the community to suggest features and help with the coding if you possibly can. We are looking at more options for peak calling, and also some very rigid downstream options, such as nuclear zone positioning and transcription factor footprinting. We’re looking into it. We don’t want to get too far in the downstream, but these look like quite good options. Then finally, I just want to say, of course, my main project is single-cell, and I really want to adapt this pipeline to work with single-cell data at some point. There’s a lot of talk around that to be had, but I really would like to have nf-core have a robust single-cell cut-and-tag pipeline, because I think that’s the future. Thanks for listening. I just wanted to thank everyone at the Luscombe and the Brisco Lab. I wanted to thank Charlotte West, who I think is on the call, because she was the original co-developer of this pipeline. She’s now left. Then also Tamara Hodgetts, who’s the new co-developer on this project. Thanks, everyone, and thanks for listening.

21:09 (host) Thank you very much. I have now enabled people to unmute themselves for a Q&A. You can, of course, also write in the chat, and I will read out the questions. Are there any questions?

(question) I have one. You’ve introduced these two new normalization options, normalizing with read count or read depth. Do you have specific scenarios in mind as to when is better to use read counts? When is better to use read depths?

(answer) Yeah. Not at the moment. Basically DeepTools has some normalization options available to it that are really based in the RNA-Seq world. There’s a bunch of transcription normalization against kilobase length of the transcripts and things like that, those classical RNA normalization techniques. And we’ve taken some of those options and introduced them just for set regions of the genome. At the moment, basically, we have a bin size of one on the genome. And we calculate the read depth at that bin size of one and then normalize against that in that region, against the other samples. Then you can widen that bin if you wanted to, to cover a larger amount of features. But it’s really just to get them a little bit more in line with each other. And we’re still waiting to see how helpful those options are downstream. But really, the other feature of it really was being able to turn the spike-in normalization off as well because it was on by default the whole time and you couldn’t change that. There is an additional option just to turn it off at all. Then the idea is that we provided these extra options and that people just start playing with them and come back to us about how useful they are. For one case in point, a group that I’m working with at the Crick, we turned off spike-in normalization and just did a bin of one read depth normalization. That resulted in the samples looking a lot better. But the IgG background was super high because the relative read depth on the IgG samples is low. You get less reads with the IgG because it’s spread out more. What we did is we included an extra parameter in the pipeline to be able to scale the IgG background back and then use it to call peaks. We have a situation now where we can basically scale the IgG to change how many peaks are being called on the sample. We’re basically now in a situation where we have to run with an IgG threshold of like 0.2, 0.4, 0.6, 0.8 and 1. Then we look at how many peaks have been called for each sample and basically tune it to the experimental question that we’re looking at. For example, with transcription factors, you may want to look at something with a bit more peaks being called so you can pick up more binding. Whereas with histones, you might want to raise the threshold for peak calling so that less peaks are called.

(comment) It’s really interesting.

(answer cont.) Yeah. It’s an active area of development. I would suggest what you do if you’re going to turn it off is do the CPM mode normalization, which is what’s recommended in the documentation, and then run it with an IgG background threshold for five different ones 0.2, 0.4, 0.6, 0.8, 1, and see how it looks in the IgG browser or however you view your peaks.

25:07 (host) Thank you very much. Artemiy, Artemiy Golden?

(question) Yes. Yes. Thank you. I have a question about the QC. You’ve stressed that you provide so many QC reports for the user to assess, but I’m coming from the perspective of a person who never did processing for the peaks, and it’s very hard to assess after you get the reports, like, is it good or not? Could you provide some representative series of different QCs from different data since you are communicating with the users, to just show that this is how a good quality would look like, and this may be how the bad quality is. It’s really non-intuitive, and I tried to find somewhere, some documentation how it should look like or in the papers, but people don’t write about it.

(answer) That’s a really excellent question, and yeah, it’s a great idea. I shall absolutely do that. I’ll create a new section of the documentation to show some examples for sure. That’s a really good idea.

(question cont.) Thanks a lot.

26:20 (host) Thank you. Then we have another question from Harshil Patel.

(question) Hi, Chris. Great talk. Thank you. My question goes back to the normalization. When you normalize in the way that you mentioned, do you factor in global changes? So say, for example, you have a control group which has a level of signal, and then you have a treatment group where you have systematic changes, so you have an uplift everywhere in that signal. The normalization on the base pair level or per region would essentially be cancelled out in that scenario, in which case you wouldn’t really see a signal even though there could possibly be a change or something to be had there, right?

(answer) Yeah, so there are options to do global normalization as well. Because it’s a bit experimental, you do have to have an understanding exactly what you just said. You have to have that understanding and know which normalization options that you need because of that. The pipeline doesn’t do it for you. But there are global normalizations. If you want to, you can literally just normalize against the total read count if you wanted to.

(comment) But even in that scenario, I think it would cancel out. Ironically, the only way you could really detect proper global changes is via spike-ins, even though they’re unreliable for this type of experiment, because it gives you some sort of reference point as to how much things are changing across your sample groups.

(answer cont.) Yeah, I couldn’t agree more. I don’t pretend to say that the readcount normalization, there’s a reason why people don’t do readcount normalization because it’s not particularly accurate. And I would agree with you that spike-in is far better as a normalization option if it’s an option. But we were just getting so many projects where there was just like 10 reads or something. There’s just no alignment whatsoever. There’s no option to do it. And also it was giving us some really screwy results, even if there were reads that were found. It’s because it just relies on the fact that you’ve got to have the same cell counts. I think that was mainly the problem we’re getting is just differing amounts of material.

(comment) Yeah, I’m always surprised that it works for this type of experiment, for RNA-Seq is very different. But here we have lots of background, you have variability in your antibodies, your variability in cell count, your variability in pull-down. Yeah, I’m surprised that it even works, to be honest, but yeah.

(answer cont.) I think it’s for the next version, we really need to start a project on this that’s a community-based project and try and find people who are interested in cut-and-run and try and find a good way of normalizing this data and taking these factors into account because none of it’s a magic bullet. And it really affects the results. You really have to run it with all different parameters. The amount of peaks that you get is completely different depending on how you parameterize the normalization.

29:44 Okay, if there are no other questions in the audience, then I want to close again by thanking Chris for a great talk. And I also would like to thank the Chan Zuckerberg Initiative for giving us some funding. And for anyone who has further questions, maybe later on, you can always reach us at the Slack channel for either cut-and-run or for bytesize. Thank you very much, everyone else.