Join us for our weekly series of short talks: nf-core/bytesize.

Just 15 minutes + questions, we focus on topics about using and developing nf-core pipelines. These are recorded and made available at https://nf-co.re , helping to build an archive of training material. Got an idea for a talk? Let us know on the #bytesize Slack channel!

This week, Payam Emami (@PayamEmami) will tell us all about the nf-core/metaboigniter pipeline.

nf-core/metaboigniter is bioinformatics pipeline for pre-processing of mass spectrometry-based metabolomics data.

Video transcription
Note

The content has been edited to make it reader-friendly

0:01 (host) Okay, so let’s go. Hi everyone, Maxime here. So, thanks for joining us today for the nf-core bytesize talk focused on pipelines. This week, it’s Payam from the national bioinformatics infrastructure Sweden, who is going to present us metaboigniter. As usual, we are on Zoom and YouTube. If you have any questions, please ask them in the chat. Gisela and I will take care of them at the end. Over to you, Payam.

0:45 Thank you, Maxime, for the introduction. Let’s get started with the presentation. We’re going to talk about metabolomics, which we often define as the measurement of small molecules. These molecules are often between 50 to 1500 daltons, and within this range, you will find sugars, lipids, amino acids, hormones, and so forth. One of the important things about this metabolome or metabolomics is its closer link to the phenotype. As a result of that, it has been used in various different areas, various different industries, including healthcare, of course, agriculture, food industry, and so forth. In this context, we can define on-target metabolomics as a methodology to detect and measure as many metabolites as possible in a given sample. There are various different instruments for doing that, various different methodology. The one that we are targeting is called liquid chromatography mass spectrometry or, in short, LC-MS.

2:03 This is a toy protocol describing how it’s been done. You are sorting some sort of sample, metabolic extraction, either by liquid-liquid interaction or by a precipitation of the proteins. The metabolites we run through a chromatography column, which is coupled to a mass spectrometer. They get ionized, they enter into the instrument, the mass-to-charge ratio will be measured, including their abundance, and then it will be passed through the data analysis or the pre-processing.

2:38 The different kind of signals that we get out of a mass spectrometer are called MS1 and MS2. MS1 data is normally used for the quantification, and that includes the retention time, that’s the time that it takes for a metabolite to go through this column, and then it enters into mass spec. I’m going to show it on the x-axis over here. We have the mass-to-charge ratio, that’s the measure that is done by the mass detector inside the instrument, and I’ll show it on the y-axis. The measured signal is the relative abundance of the metabolites. I prefer to show it as the intensity of the color, and we often call it intensity or abundance. In different contexts, you might see this plots. These are essentially the same thing, it’s called an intensity map or even heat map. If you zoom in, you can see the same pattern that I showed you in the previous plot. This one is essentially the distribution of ions, either people plot it over time or over the mass range. But we are going to stick with this simple representation, and one important point I want to make here is that when we measure molecules, it comes in different signals, not single signal, into mass spec.

4:03 One signal that we get is called the elution profile of the molecule. As I said, for a single mass, for a single molecule, this is the time it takes for a molecule to pass through the column. If you have a good, high-resolution instrument, we might be able also to find the isotopes of a single molecule. These are essentially the isotopes naturally happening in the nature, and often we see it with the one dot mass shift between the monoisotopic peak of the metabolite down here. Another pattern that we see is called adducts, and depending on the matrix and what kind of laboratory procedure has been done, we might get different elements, make it bound with our molecules, and they often cause some mass shift. We also see the same, this kind of pattern, which is important that these are still the same molecule. It’s just the way that we have been preparing the samples, they affect the mass of the molecule.

5:14 What I often say is that the mass spectrometry pre-processing is a world of clustering. We do a lot of clustering in order to cluster the similar ions, the ions that we think that are coming from the same molecule together. This case, each box clusters the elution profile of one metabolite, including its isotope pattern. You can also do the cross box clustering, and that’s for the adduct detection. Each of these lines, each of these guys along the time, they’re called mass traces, and when we draw a box around it, then we cluster the mass traces of different isotopes of the molecules, they are normally called the feature, so feature detection. This process is normally done per sample.

6:01 Each individual sample is being processed here, but what we often have is multiple samples. What we need to do is that we need to do another round of clustering in order for us to say that this metabolite is this metabolite in this sample too, right? So we want to link them across different samples. It comes with the problem of the chromatography shift. Time shift can be different between different metabolites across different samples, and we can also have the mass deviation. What’s happening is that some alignment has been done between different samples, map them to the same scale, and then we can do the clustering to find the corresponding metabolite. If we fail to do this kind of clustering, like this example, then we end up having missing values. This is essentially a quick introduction on how these kind of quantifications perform, and at the end of this stage, we can essentially extract the MZ of the metabolite, the retention of the molecules. We still don’t know if they’re all metabolites or not, but what we have to do is that we have to identify them, I mean, one naive approach can be to take this MZ of the metabolite, take it to any of the databases, and then search it with some deviation boundary, and this turned out to be a very low accuracy. I mean, in this case, I just took one MZ and the search is like 166. One point I want to make is that this kind of identification is to a large extent a problem of ranking. The way that we rank the metabolite, as the true metabolite gets higher rank, it’s the better, right? What we want to do is that we want to incorporate as much orthogonal information as possible in order to improve this ranking, so our true metabolite come to the top of the list.

7:55 One type of this orthogonal data that we use is called the MS2 data, or fragmentation, or tandem spectra. In this case, the mass spectrometer, depending on the setup, will select a few ions and then break them apart. The molecule will break apart into different pieces, and then these pieces will get measured again by mass spec. What we’re hoping is, that we can do the same thing in the databases that are available out there, and then try to figure out whether what we have done in silico match what we have done in the real experiment. There are different methods for doing that. The databases searches the novel reconstruction and the hybrid approach, but I guess the most common one is the hybrid approach, where we wrap the dataset into a sound form, in this case a tree, and then try to score the trees together in some model and then try to infer the metabolite without going too much into the detail.

9:02 Our workflow, metaboigniter, tries to automate the steps that I was talking about. What we provide is obviously the quantification, and we provide the parameter tuning, we provide two different quantification packages that the user can choose and combine, three plus one identification engines that I will go through. We provide QC and noise removal and the whole thing can be done on pos and neg, both combined or in solo. For the quantification part, we start getting the raw data or the MZML files converted from the user. We do an optional centroiding, or peak picking. We have an optional parameter tuning, automatically tuning the rest of the parameters over here and the parameters from here will be propagated throughout the workflow. This option can be done on a single sample or on a collection of samples, for example, if you have QC samples, or on the whole cohort. That will be followed by the feature detection or masters detection, either by XCMS or OpenMS, and then retention time correction and grouping.

10:17 The result of this will pass through a noise removal step and that includes the blank filtering that filters out the signal, which is presented the blank with the same magnitude as the sample. We have the QC filtering that filters out the signal that is not stable over repeated samples. Then finally, the dilution filtering, which filters out the signal, which is not following a dilution theory, if you have done so in your experiment. That will be passed through our famous CAMERA, adduct detection and isotope detection, which different features will be linked together across the time, and the isotopes will be detected adducts. At this stage, if the user has selected not to do the MS2 part, the workflow will finish. We do some transformation and normalization, depending on the user choice, and then we provide tabular output.

11:18 However, if the user selected to do the identification, we keep the result of feature detection from the quantification part, fetch it here and we read the MS2 data. We do an optional centroiding of MS2 data, and we do a mapping. This mapping is done in a way that we try to figure out the mass traces corresponding MS2 spectrum to the mass traces. Essentially we are mapping the MS2 data on top of the MS1 data in order for us to be able to say, hey, these fragment ions, they are originated from this specific feature. This will allow us to do some MS1 driven clustering, and then we do what we call a hyper-MS2 construction.

12:16 That hyper-MS2 construction will aggregate all the MS2 spectra that we think are coming from a single ion. Then various different pre-processing can be performed on this hyper-MS2, including clustering, smoothing, or centroiding. This can also be skipped, but we are hoping that by doing that, they reduce the number of searches required to do, and we also capture different aspects of the fragmentation. At this stage, we fetch the adduct information again from the quantification part of the workflow, and then we feed this information together with MS2 to our mass calculator, and that will estimate the neutral mass of the metabolite, which we will send to different search engines. The Metfrag/CSI-FINGERID and CFM-ID are supported, and they can be combined, or any solo can be run. That will be followed by the posterior probability estimation for bringing the scores on the same scale and the tabular output.

13:24 Another type of identification that we support, we call it the library identification or in-house library identification. This type of identification… imagine different labs might have different purified metabolites, and the idea is that we want to say whether this metabolite is present in our actual biological sample or not. What’s happening is that we are willing to extract experimental elements. We know the theoretical mass of them and we want to extract the retention time based on the chromatography setup that we have. We want to find MS2s for this compound. What we do is that we separate what’s happening in the laboratory procedure. These metabolites will be split into different buffers, in a way that in each buffer we don’t have metabolites with overlapping masses. The idea is that we can later go and find these metabolites and find the retention time and the mass and the MS2. It comes with the benefit of having, for example, the tailored or very fine-tuned fragmentation pattern. Or we can use the retention time for matching to the actual biological sample.

14:31 Our metaboigniter can also do both, characterization and searching of an internal library. What we do here - I’m talking about only the library samples, not the quantification sample - we get the library file, we do an optional centroid and feature detection, exactly like we did for the real biological samples. The same thing for the MS2 data, we have an optional.. A mapping step that’s not optional. At this stage, we get a list of theoretical masses from the user and the list of samples of which the user knows what metabolites are in there, and we try to estimate the retention boundary of each of the metabolites, each of the samples, and then map it to the information that the user has. That is follows with the previously mentioned hyperspectral, and at this stage we fish out the result of the quantification and the identification. We have an internal search engine that tries to match the retention time plus the MS2 data and the MS1 data and try to find whether these metabolites are present in the sample or not. That will be followed by the posterior probability estimation and the actual output. That’s essentially pretty much what the metaboigniter does under the hood.

15:54 Obviously, there’s about 500 parameters that needs to be set, so I don’t want to show the whole command, obviously. But we wanted to guide you how to use the nf-core interface for running the pipeline, and that gives you this nice json file that you can input and run. The inputs we are getting, the main inputs, they are the mzML file, obviously, and you need this path referring to the mz;L files in the positive mode, negative mode, or any of the ionization. The accepted phenotype file, that’s essentially a CSV file or a table describing what kind of MS1 data you’re inputting into the workflow, and that’s pretty much in this. You have to put the name of the files and the class of the samples, you might have the biological samples and you have the blank samples, dilution series, QCs, and so forth. You can define various different things if you want to remove them, so if you want to rename the samples later in the output, you can mention it there. If you have technical replicas, we can do averaging of them at a later stage. You can put any covariates that you want us later to link it to our output.

17:13 We provide three different outputs, and all of them are tabular. We have a peak matrix, that’s essentially the abundance matrix. The variables are on the row and the samples are on the columns. This is the raw data but we can do a transformation also. The missing values are shown by the NA or non-assign. The metadata file, which is essentially what the user has been inputting into the workflow, but we just reformatted and reordered to match the peak matrix. The variable information is probably the most important output of the workflow, and that includes all the information that we have been extracting from each of the features in the data. That includes, for example, the monocytopic mass, the different IDs for the metabolites, the name of them, if we have been able to identify them, if they have been associated with any MS2, we also have MS2, and so forth. This is only a fraction of the things that I have been caught in here. So many, many different types of information it’s providing.

18:31 At this stage, we are not doing the downstream analysis, and we think that that should be done by the biostatisticians. However, the outputs that we are providing are fully compatible with almost all the tools that are in workflows for metabolomics that have been developed by the Workflow4metabolomics community. That includes very specific statistical tools, normal PCA, PLS, network analysis, clustering, and so forth. You should be able to use this output out of the box with minimum modification.

19:12 Things on the table that we have been developing and, I assume, will be part of the workflow. Right now we are only supporting MSML, but the conversion from the raw format is part of it. We are supporting four search engines, and the idea is that we want to give you an aggregated result of the search engines. Right now we are giving you four results, but the idea is that we do a consensus ranking of this ID. We know that recently people came up with fantastic identification for post-processing for metabolomics identification, mainly for the CSI-FINGERID, and that’s going to be part of this. We are supporting right now, I think, two different alignment and retention time correction and grouping methods, but we are going to add more into it. Metabolite class detection is already part of the workflow. The parameter is not exposed to the user, but it will be, and it will be part of the output, and we are in the process of migrating to DSL2. It should not affect the user experience that much, but from the developer part it’s a big deal.

20:19 I think with this I want to thank all the people. I don’t name people, but a lot of great persons have been part of this workflow in the past, I think, seven years or something. I thank them all. This work was a spin-off from PhenoMeNal infrastructure, but they’re supported by elixir and the National Bioinformatics Infrastructure, Sweden, and a big thank-you also to the nf-core community. These guys are absolutely amazing, just helped without any expectations. Thank you very much for listening.

20:56 (host) Thanks, Payam, for such a great presentation, that was amazing. I’d like to thank the Chan Zuckerberg Initiative for giving us the opportunity to do this series, and now let’s see if anyone has any questions. I don’t see any questions in the chat at the moment. Let me just check if we have any on YouTube as well. I don’t see any questions over there either. It must have been a super clear presentation. I wasn’t very familiar with this field before, but I felt like I understood most of it. For me, it was pretty clear, at least.

(speaker) That’s great. I mean, if there was any question, I know that the amount of information that is coming is probably a big deal. But if there was any question, we are always ready.

(host) Yes. Otherwise, people know that they can still ask you any question on either the metaboigniter Slack channel or on the bytesize channel. I think we’re good.