Introduction

This document describes the output produced by the pipeline.

The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory.

Pipeline overview

The pipeline is built using Nextflow and processes data using the following steps:

  1. Parameter check: Several parameters are validated to ensure that the pipeline can run successfully.
  2. RUN_CV subworkflow: Finds the optimal hyperparameters for each model in a cross-validation setting.
    • Load response: The response data is loaded.
    • CV split: The response data is split into cross-validation folds.
    • Make model channel: From the input baseline and model names, channels are created. This step is necessary because for the Single-Drug Models, one model has to be created per drug.
    • HPAM split: One YAML file is created per model and hyperparameter combination to be tested.
    • Train and predict CV: All models are trained and evaluated in a cross-validation setting.
    • Evaluate and find max: For each CV split, the best hyperparameters are determined using a grid search per model
  3. MODEL_TESTING subworkflow: The best hyperparameters are used to train the models on the full training set and predict the test set. Optionally, randomization and robustness testes are performed.
    • Predict full: The model is trained on the full training set (train & validation) with the best hyperparameters to predict the test set.
    • Randomization split: Makes a channel per randomization to be tested.
    • Randomization test: If randomization tests are enabled, the model is trained on the full training set with the best hyperparameters to predict the randomized test set.
    • Robustness test: If robustness tests are enabled, the model is trained N times on the full training set with the best hyperparameters
    • Consolidate results: The results of the model testing are consolidated into a single table for each model.
    • Evaluate final: The performance of the models is calculated on the test set results.
    • Collect results: The results of the evaluation metrics per model are collected into four overview tables.
  4. VISUALIZATION subworkflow: Plots are created summarizing the results.
    • Critical difference plot: A critical difference plot is created to compare the performance of the models.
    • Violin plot: A violin plot is created to compare the performance of the models over the CV folds.
    • Heatmap: A heatmap is created to compare the average performance of the models over the CV folds.
    • Correlation comparison: Renders a plot in which the per-drug/per-cell line correlations between y_true and y_predicted are compared between different models.
    • Regression plots: Plots in which the y_true and y_predicted values are compared between different models.
    • Save tables: Saves the performance metrics of the models in a table.
    • Write html: Writes the plots to an HTML file per setting (LPO/LCO/LDO).
    • Write index: Writes an index.html file that links to all the HTML files.
  5. Pipeline information - Report metrics generated during the workflow execution

Parameter check

The process PARAMS_CHECK performs the following checks:

  • --models / --baselines: Check if the model and baseline names are valid (for valid names, see the usage page).
  • --test_mode: Check whether the test mode is LPO, LCO, LDO or a combination of these.
  • --dataset_name: Check if the dataset name is valid, i.e., GDSC1, GDSC2, or CCLE.
  • --cross_study_datasets: If supplied, check if the datasets are valid, i.e., GDSC1, GDSC2, or CCLE or a combination of these.
  • --n_cv_splits: Check if the number of cross-validation splits is a positive integer > 1.
  • --randomization_mode: If supplied, checks if the randomization is SVCC, SVCD, SVRC, SVRD, or a combination of these.
  • --randomization_type: If supplied, checks if the randomization type is valid, i.e., permutation or invariant.
  • --n_trials_robustness: Checks if the number of trials for robustness tests is >= 0.
  • --optim_metric: Checks if the optimization metric is either MSE, RMSE, MAE, R^2, Pearson, Spearman, Kendall, or Partial_Correlation.
  • --response_transformation: If supplied, checks whether the response transformation is either standard, minmax, or robust.

It emits the path to the data but mostly so that the other processes wait for PARAMS_CHECK to finish before starting.

Subworkflow RUN_CV

Load response

The response data is loaded into the pipeline. This step is necessary to provide the pipeline with the response data that will be used to train and evaluate the models.

Optional output files if --save_datasets is set
  • response_dataset.pkl: The response data is saved as a pickle file.
  • cross_study_*.pkl: The response data for the cross-study datasets is saved as a pickle file.

CV split

The response data is split into as many cross-validation folds as specified over the --n_cv_splits parameter. The data is split into training, validation, and test sets for each fold. For models using early stopping, the early stopping dataset is split from the validation set. This ensures that all models are trained and evaluated on the same dataset.

Optional output file if --save_datasets is set
  • split*.pkl: The response data belonging to each fold is saved as a pickle file.

Make model channel

From the input baseline and model names, channels are created. This step is necessary because for the Single-Drug Models, one model has to be created per drug. The model name then becomes the name of the model and the drug, separated by a dot, e.g., MOLIR.Drug1. All of these models should be able to be trained in parallel which is why they should be individual elements in the channel.

Hyperparameter split

One YAML file is created per model and hyperparameter combination to be tested. This ensures that all hyperparameter can be tested in parallel.

Train and predict CV

A model is trained in the specified test mode, on the specific cross-validation split with the specified hyperparameter combination.

As soon as the GPU support is available, the training and prediction will be done on the GPU for the models SimpleNeuralNetwork, MultiOmicsNeuralNetwork, MOLIR, SuperFELTR, and DIPK.

Evaluate and find max

Over all hyperparameter combinations, the best hyperparameters for a specific cross-validation split are determined. The best hyperparameters are determined based on the optimization metric specified via --optim_metric.

Subworkflow MODEL_TESTING

Predict full

The model is trained on the full training set (train & validation) per split with the best hyperparameters to predict the test set of the CV split. If specified via --cross_study_datasets, the cross-study datasets are also predicted.

Output files
  • **predictions*.csv: CSV file with the predicted response values for the test set.
  • **cross_study/cross_study*.csv: CSV file with the predicted response values for the cross-study datasets.
  • **best_hpams*.json: JSON file with the best hyperparameters for the model.

Randomization split

Takes the --randomization_mode as input and creates a channel for each randomization to be tested. This ensures that all randomizations can be tested in parallel.

Randomization test

Trains the model on the randomized training + validation set with the best hyperparameters to predict the unperturbed test set of the specified CV split. How the data is randomized is determined by the --randomization_type.

As soon as GPU support is available, the training and prediction will be done on the GPU for the models SimpleNeuralNetwork, MultiOmicsNeuralNetwork, MOLIR, SuperFELTR, and DIPK.

Output files
  • **randomization*.csv: CSV file with the predicted response values for the randomization test.

Robustness test

Trains the model --n_trials_robustness times on the full training set with the best hyperparameters to predict the test set of the specific CV split.

As soon as GPU support is available, the training and prediction will be done on the GPU for the models SimpleNeuralNetwork, MultiOmicsNeuralNetwork, MOLIR, SuperFELTR, and DIPK.

Output files
  • **robustness*.csv: CSV file with the predicted response values for the robustness test.

Consolidate results

For Single-Drug Models, the results of the model testing are consolidated such that their results look like the results of the Multi-Drug Models.

Output files
  • **predictions*.csv: CSV file with the consolidated predicted response values for the test set.
  • **cross_study/cross_study*.csv: CSV file with the consolidated predicted response values for the cross-study datasets.
  • **randomization*.csv: CSV file with the consolidated predicted response values for the randomization test.
  • **robustness*.csv: CSV file with the consolidated predicted response values for the robustness test.

Evaluate final

Calculates various performance metrics on the given test set results, including RMSE, MSE, MAE, R^2, Pearson Correlation, Spearman Correlation, Kendall Correlation, and Partial Correlation.

Collect results

Collapses the results from above into four overview tables: evaluation_results.csv, evaluation_results_per_drug. csv, evaluation_results_per_cell_line.csv, and true_vs_pred.csv.

Output files
  • evaluation_results.csv: Overall performance metrics. One value per model per CV fold and setting (LPO/LCO/LDO, full predictions, randomizations, robustness, cross-study predictions).
  • evaluation_results_per_drug.csv: Performance metrics calculated per drug.
  • evaluation_results_per_cell_line.csv: Performance metrics calculated per cell line.
  • true_vs_pred.csv: true vs predicted values for each model.

Subworkflow VISUALIZATION

Critical difference

The critical difference plot measures whether a model is significantly better than another model measured over its average rank over all CV folds.

Output files
  • critical_difference*.svg: SVG file with the critical difference plot.

Violin plot

The violin shows the distribution of the performance metrics over the CV folds. This plot is rendered overall for all real predictions and once per algorithm to compare the real predictions against, e.g., the randomization results.

Output files
  • violin*.html: HTML file with the violin plot.

Heatmap

The heatmap shows the average performance of the models over the CV folds. This plot is rendered overall for all real predictions and once per algorithm to compare the real predictions against, e.g., the randomization results.

Output files
  • heatmap*.html: HTML file with the violin plot.

Correlation comparison

Renders a plot in which the per-drug/per-cell line correlations between y_true and y_predicted are compared between different models.

Output files
  • corr_comp_scatter*.html: HTML file with the violin plot.

Regression plots

Plots in which the y_true and y_predicted values are compared between different models.

Output files
  • regression_lines*.html: HTML file with the violin plot.

Save tables

Saves the performance metrics of the models in an html table.

Output files
  • table*.html: HTML file with the violin plot.

Write html

Creates a summary HTML file per setting (LPO/LCO/LDO) that contains all the plots and tables.

Output files
  • {LPO,LCO,LPO}.html: HTML file with the violin plot.

Write index

Writes an index.html file that links to all the HTML files.

Output files
  • index.html: HTML file with the violin plot.
  • *.png: Some png files for the logo, etc.

Pipeline information

Output files
  • pipeline_info/
    • Reports generated by Nextflow: execution_report.html, execution_timeline.html, execution_trace.txt and pipeline_dag.dot/pipeline_dag.svg.
    • Reports generated by the pipeline: pipeline_report.html, pipeline_report.txt and software_versions.yml. The pipeline_report* files will only be present if the --email / --email_on_fail parameter’s are used when running the pipeline.
    • Reformatted samplesheet files used as input to the pipeline: samplesheet.valid.csv.
    • Parameters used by the pipeline run: params.json.

Nextflow provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage.