nf-core/scflow
Please consider using/contributing to https://github.com/nf-core/scdownstream
Define where the pipeline should find input data and save output data.
The .tsv file specifying sample matrix filepaths.
string
./refs/Manifest.txt
The .tsv file specifying sample metadata.
string
./refs/SampleSheet.tsv
Optional tsv file containing mappings between ensembl_gene_id's and gene_names's
string
https://raw.githubusercontent.com/nf-core/test-datasets/scflow/assets/ensembl_mappings.tsv
Cell-type annotations reference file path
string
https://s3-eu-west-1.amazonaws.com/pfigshare-u-files/28033407/ctd_v1.zip
This is a zip file containing cell-type annotation reference files for the EWCE package.
Optional tsv file specifying manual revisions of cell-type annotations.
string
./conf/celltype_mappings.tsv
Optional list of genes of interest in YML format for plotting of gene expression.
string
./conf/reddim_genes.yml
Input sample species.
string
human
Currently, "human" and "mouse" are supported.
Outputs directory.
string
./results
Parameters for quality-control and thresholding.
The sample sheet column name with unique sample identifiers.
string
manifest
The sample sheet variables to treat as factors.
string
seqdate
All sample sheet columns with numbers which should be treated as factors should be specified here separated by commas. Examples include columns with dates, numeric sample identifiers, etc.
Minimum library size (counts) per cell.
integer
250
Maximum library size (counts) per cell.
string
adaptive
Minimum features (expressive genes) per cell.
integer
100
Maximum features (expressive genes) per cell.
string
adaptive
Minimum proportion of counts mapping to ribosomal genes.
number
Maximum proportion of counts mapping to ribosomal genes.
number
1
Maximum proportion of counts mapping to mitochondrial genes.
string
adaptive
Minimum counts for gene expressivity.
integer
2
Expressive genes must have >=min_counts in >=min_cells
Minimum cells for gene expressivity.
integer
2
Expressive genes must have >=min_counts in >=min_cells
Option to drop unmapped genes.
string
True
Option to drop mitochondrial genes.
string
True
Option to drop ribosomal genes.
string
false
The number of MADs for outlier detection.
number
4
The number of median absolute deviations (MADs) used to define outliers for adaptive thresholding.
Options for profiling ambient RNA/empty droplets.
Enable ambient RNA / empty droplet profiling.
string
true
Upper UMI counts threshold for true cell annotation.
string
auto
A numeric scalar specifying the threshold for the total UMI count above which all barcodes are assumed to contain cells, or "auto" for automated estimation based on the data.
This parameter must be a combination of the following values:d
, auto
Lower UMI counts threshold for empty droplet annotation.
integer
100
A numeric scalar specifying the lower bound on the total UMI count, at or below which all barcodes are assumed to correspond to empty droplets.
The maximum FDR for the emptyDrops algorithm.
number
0.001
Number of Monte Carlo p-value iterations.
integer
10000
An integer scalar specifying the number of iterations to use for the Monte Carlo p-value calculations for the emptyDrops algorithm.
Expected number of cells per sample.
integer
3000
If the "retain" parameter is set to "auto" (recommended), then this parameter is used to identify the optimal value for "retain" for the emptyDrops algorithm.
Parameters for identifying singlets/doublets/multiplets.
Enable doublet/multiplet identification.
string
true
Algorithm to use for doublet/multiplet identification.
string
doubletfinder
Variables to regress out for dimensionality reduction.
string
nCount_RNA,pc_mito
Number of PCA dimensions to use.
integer
10
The top n most variable features to use.
integer
2000
A fixed doublet rate.
number
Use a fixed default rate (e.g. 0.075 to specify that 7.5% of all cells should be marked as doublets), or set to 0 to use the "dpk" method (recommended).
Doublets per thousand cells increment.
integer
8
The doublets per thousand cell increment specifies the expected doublet rate based on the number of cells, i.e. with a dpk of 8 (recommended by 10X), a dataset with 1000 cells is expected to contain 8 doublets per thousand cells, a dataset with 2000 cells is expected to contain 16 doublets per thousand cells, and a dataset with 10000 cells is expected to contain 80 cells per thousand cells (or 800 doublets in total). If the "doublet_rate" parameter is manually specified this recommended incremental behaviour is overridden.
Specify a pK value instead of parameter sweep.
number
0.02
The optimal pK value used by the doubletFinder algorithm is determined following a compute-intensive parameter sweep. The parameter sweep can be overridden by manually specifying a pK value.
Parameters used in the merged quality-control report.
Numeric variables for inter-sample metrics.
string
total_features_by_counts,total_counts,pc_mito,pc_ribo
A comma-separated list of numeric variables which differ between individual cells of each sample. The merged sample report will include plots facilitating between-sample comparisons for each of these numeric variables.
Categorical variables for further sub-setting of plots
string
NULL
A comma-separated list of categorical variables. The merged sample report will include additional plots of sample metrics subset by each of these variables (e.g. sex, diagnosis).
Numeric variables for outlier identification.
string
total_features_by_counts,total_counts
The merged report will include tables highlighting samples that are putative outliers for each of these numeric variables.
Parameters for integrating datasets and batch correction.
Choice of integration method.
string
Liger
Unique sample identifier variable.
string
manifest
Fill out matrices with union of genes.
string
false
See rliger::createLiger(). Whether to fill out raw.data matrices with union of genes across all datasets (filling in 0 for missing data) (requires make.sparse = TRUE) (default FALSE).
Remove non-expressing cells/genes.
string
true
See rliger::createLiger(). Whether to remove cells not expressing any measured genes, and genes not expressed in any cells (if take.gene.union = TRUE, removes only genes not expressed in any dataset) (default TRUE).
Number of genes to find for each dataset.
integer
3000
See rliger::selectGenes(). Number of genes to find for each dataset. Optimises the value of var.thresh for each dataset to get this number of genes.
How to combine variable genes across experiments.
string
union
See rliger::selectGenes(). Either "union" or "intersection".
Keep unique genes.
string
false
See rliger::selectGenes().
Capitalize gene names to match homologous genes.
string
false
See rliger::selectGenes().
Treat each column as a cell.
string
true
See rliger::removeMissingObs().
Inner dimension of factorization (n factors).
integer
30
See rliger::optimizeALS(). Inner dimension of factorization (number of factors). Run suggestK to determine appropriate value; a general rule of thumb is that a higher k will be needed for datasets with more sub-structure.
Regularization parameter.
number
5
See rliger::optimizeALS(). Regularization parameter. Larger values penalize dataset-specific effects more strongly (ie. alignment should increase as lambda increases). Run suggestLambda to determine most appropriate value for balancing dataset alignment and agreement (default 5.0).
Convergence threshold.
number
0.0001
See rliger::optimizeALS().
Maximum number of block coordinate descent iterations.
integer
100
See rliger::optimizeALS().
Number of restarts to perform.
integer
1
See rliger::optimizeALS().
Random seed for reproducible results.
integer
1
Number of neearest neighbours for within-dataset knn graph.
integer
20
See rliger::quantile_norm().
Horizon parameter for shared nearest factor graph.
integer
500
See rliger::quantileAlignSNF(). Distances to all but the k2 nearest neighbors are set to 0 (cuts down on memory usage for very large graphs).
Minimum allowed edge weight.
number
0.2
See rliger::quantileAlignSNF().
Name of dataset to use as a reference.
string
NULL
See rliger::quantile_norm(). Name of dataset to use as a "reference" for normalization. By default, the dataset with the largest number of cells is used.
Minimum number of cells to consider a cluster shared across datasets.
integer
2
See rliger::quantile_norm().
Number of quantiles to use for normalization.
integer
50
See rliger::quantile_norm().
Number of times to perform Louvain community detection.
integer
10
See rliger::quantileAlignSNF(). Number of times to perform Louvain community detection with different random starts (default 10).
Controls the number of communities detected.
integer
1
See rliger::quantileAlignSNF().
Indices of factors to use for shared nearest factor determination.
string
NULL
See rliger::quantile_norm().
Distance metric to use in calculating nearest neighbour.
string
CR
See rliger::quantileAlignSNF(). Default "CR".
Center the data when scaling factors.
string
false
See rliger::quantile_norm().
Small cluster extraction cells threshold.
integer
See rliger::quantileAlignSNF(). Extracts small clusters loading highly on single factor with fewer cells than this before regular alignment (default 0 – no small cluster extraction).
Categorical variables for integration report metrics.
string
individual,diagnosis,region,sex
The integration report will provide plots and integration metrics for these categorical variables.
Reduced dimension embedding for the integration report.
string
UMAP
The integration report will provide with and without integration plots using this embedding.
Settings for dimensionality reduction algorithms.
Input matrix for dimension reduction.
string
PCA,Liger
Dimension reduction outputs to generate.
string
tSNE,UMAP,UMAP3D
Typically 'UMAP,UMAP3D' or 'tSNE'.
Variables to regress out before dimension reduction.
string
nCount_RNA,pc_mito
Number of PCA dimensions.
integer
30
See uwot::umap().
Number of nearest neighbours to use.
integer
35
See uwot::umap().
The dimension of the space to embed into.
integer
2
See uwot::umap(). The dimension of the space to embed into. This defaults to 2 to provide easy visualization, but can reasonably be set to any integer value in the range 2 to 100.
Type of initialization for the coordinates.
string
See uwot::umap().
Distance metric for finding nearest neighbours.
string
See uwot::umap().
Number of epochs to us during optimization of embedded coordinates.
integer
200
See uwot::umap().
Initial learning rate used in optimization of coordinates.
integer
1
See uwot::umap().
Effective minimum distance between embedded points.
number
0.4
See uwot::umap(). Smaller values will result in a more clustered/clumped embedding where nearby points on the manifold are drawn closer together, while larger values will result on a more even dispersal of points. The value should be set relative to the spread value, which determines the scale at which embedded points will be spread out.
Effective scale of embedded points.
number
0.85
See uwot::umap(). In combination with min_dist, this determines how clustered/clumped the embedded points are.
Interpolation to combine local fuzzy sets.
number
1
See uwot::umap(). The value of this parameter should be between 0.0 and 1.0; a value of 1.0 will use a pure fuzzy union, while 0.0 will use a pure fuzzy intersection.
Local connectivity required.
integer
1
See uwot::umap(). The local connectivity required – i.e. the number of nearest neighbors that should be assumed to be connected at a local level. The higher this value the more connected the manifold becomes locally.
Weighting applied to negative samples in embedding optimization.
integer
1
See uwot::umap(). Weighting applied to negative samples in low dimensional embedding optimization. Values higher than one will result in greater weight being given to negative samples.
Number of negative edge samples to use per positive edge sample.
integer
5
See uwot::umap(). The number of negative edge/1-simplex samples to use per positive edge/1-simplex sample in optimizing the low dimensional embedding.
Use fast SGD.
string
false
See uwot::umap(). Setting this to TRUE will speed up the stochastic optimization phase, but give a potentially less accurate embedding, and which will not be exactly reproducible even with a fixed seed. For visualization, fast_sgd = TRUE will give perfectly good results. For more generic dimensionality reduction, it's safer to leave fast_sgd = FALSE.
Output dimensionality.
integer
2
See Rtsne::Rtsne().
Number of dimensions retained in the initial PCA step.
integer
50
See Rtsne::Rtsne().
Perplexity parameter.
integer
150
See Rtsne::Rtsne().
Speed/accuracy trade-off.
number
0.5
See Rtsne::Rtsne(). Speed/accuracy trade-off (increase for less accuracy), set to 0.0 for exact TSNE (default: 0.5).
Iteration after which perplexities are no longer exaggerated.
integer
250
See Rtsne::Rtsne(). Iteration after which the perplexities are no longer exaggerated (default: 250, except when Y_init is used, then 0).
Iteration after which the final momentum is used.
integer
250
See Rtsne::Rtsne(). Iteration after which the final momentum is used (default: 250, except when Y_init is used, then 0).
Number of iterations.
integer
1000
See Rtsne::Rtsne().
Center data before PCA.
string
true
See Rtsne::Rtsne(). Should data be centered before pca is applied? (default: TRUE)
Scale data before PCA.
string
false
See Rtsne::Rtsne(). Should data be scaled before pca is applied? (default: FALSE).
Normalize data before distance calculations.
string
true
See Rtsne::Rtsne(). Should data be normalized internally prior to distance calculations with normalize_input? (default: TRUE)
Momentum used in the first part of optimization.
number
0.5
See Rtsne::Rtsne().
Momentum used in the final part of optimization.
number
0.8
See Rtsne::Rtsne().
Learning rate.
integer
1000
See Rtsne::Rtsne().
Exaggeration factor used in the first part of the optimization.
integer
12
See Rtsne::Rtsne(). Exaggeration factor used to multiply the P matrix in the first part of the optimization (default: 12.0).
Parameters used to tune louvain/leiden clustering.
Clustering method.
string
leiden
Specify "leiden" or "louvain".
Reduced dimension input(s) for clustering.
string
UMAP_Liger
One or more of "UMAP", "tSNE", "PCA", "LSI".
The resolution of clustering.
number
0.001
Integer number of nearest neighbours for clustering.
integer
50
Integer number of nearest neighbors to use when creating the k nearest neighbor graph for Louvain/Leiden clustering. k is related to the resolution of the clustering result, a bigger k will result in lower resolution and vice versa.
The number of iterations for clustering.
integer
1
Parameters used for cell-type annotation and the associated report.
SingleCellExperiment clusters colData variable name.
string
clusters
Max cells to sample.
integer
10000
A sample metadata unique sample ID.
string
individual
SingleCellExperiment cell-type colData variable name.
string
cluster_celltype
Cell-type metrics for categorical variables.
string
manifest,diagnosis,sex,capdate,prepdate,seqdate
Cell-type metrics for numeric variables.
string
pc_mito,pc_ribo,total_counts,total_features_by_counts
Number of top marker genes for plot/table generation.
integer
5
Parameters for differential gene expression.
Differential gene expression method.
string
MASTZLM
MAST method.
string
See MAST::zlm(). Either 'glm', 'glmer' or 'bayesglm'.
Expressive gene minimum counts.
integer
1
Only genes with at least min_counts in min_cells_pc will be tested for differential gene expression.
Expressive gene minimum cells fraction.
number
0.1
Only genes with at least min_counts in min_cells_pc will be tested for differential gene expression. Default 0.1 (i.e. 10% of cells).
Re-scale numeric covariates.
string
true
Re-scaling and centring numeric covariates in a model can improve model performance.
Pseudobulked differential gene expression.
string
false
Perform differential gene expression on a smaller matrix where counts are first summed across all cells within a sample (defined by dge_sample_var level).
Cell-type annotation variable name.
string
cluster_celltype
Differential gene expression is performed separately for each cell-type of this colData variable.
Unique sample identifier variable.
string
manifest
Dependent variable of DGE model.
string
group
The dependent variable may be a categorical (e.g. diagnosis) or a numeric (e.g. histopathology score) variable.
Reference class of categorical dependent variable.
string
Control
If a categorical dependent variable is specified, then the reference class of the dependent variable is specified here (e.g. 'Control').
Confounding variables.
string
cngeneson,seqdate,pc_mito
A comma-separated list of confounding variables to account for in the DGE model.
Random effect confounding variable.
string
NULL
If specified, the term + (1 | x ) +
is added to the model, where x is the specified random effects variable.
Fold-change threshold for plotting.
number
1.1
This absolute fold-change cut-off value is used in plots (e.g. volcano) and the DGE report.
Adjusted p-value cutoff.
number
0.05
The adjusted p-value cutoff value is used in plots (e.g. volcano) and the DGE report.
Force model fit for non-full rank.
string
false
A non-full rank model specification will return an error; to override this to return a warning only, set to TRUE.
Maximum CPU cores.
string
'null'
The default value of 'null' utilizes all available CPU cores. As each additional CPU core increases the number of genes simultaneously fit, the RAM/memory demand increases concomitantly. Manually overriding this parameter can reduce the memory demands of parallelization across multiple cores.
Parameters for impacted pathway analysis of differentially expressed genes.
Pathway enrichment tool(s) to use.
string
Enrichment method.
string
ORA
Database(s) to use for enrichment.
string
GO_Biological_Process
See scFlow::list_databases(). Name of the database(s) for enrichment. Examples include "GO_Biological_Process", "GO_Cellular_Component", "GO_Molecular_Function", "KEGG", "Reactome", "Wikipathway".
Parameters for dirichlet modeling of relative cell-type proportions.
Unique sampler identifier.
string
individual
Cell-type annotation variable name.
string
cluster_celltype
Dependent variable of Dirichlet model.
string
group
Reference class of categorical dependent variable.
string
Control
Dependent variable classes order.
string
Control,Low,High
For plotting and reports, the order of classes for the dependent variable can be manually specified (e.g. 'Control,Low,High').
General parameters for plotting.
Preferred embedding for plots.
string
UMAP_Liger
Point size for reduced dimension plots.
number
0.1
To improve visualization the point size should be adjusted according to the total number of cells plotted.
Alpha (transparency) value for reduced dimension plots.
number
0.2
To improve visualization the alpha (transparency) value should be adjusted according to the total number of cells plotted.
Parameters used to describe centralised config profiles. These should not be edited.
Git commit id for Institutional configs.
string
master
Base directory for Institutional configs.
string
https://raw.githubusercontent.com/nf-core/configs/master
If you're running offline, Nextflow will not be able to fetch the institutional config files from the internet. If you don't need them, then this is not a problem. If you do need them, you should download the files from the repo and tell Nextflow where to find them with this parameter.
Institutional configs hostname.
string
Institutional config name.
string
Institutional config description.
string
Institutional config contact information.
string
Institutional config URL link.
string
Set the top limit for requested resources for any single job.
Maximum number of CPUs that can be requested for any single job.
integer
16
Use to set an upper-limit for the CPU requirement for each process. Should be an integer e.g. --max_cpus 1
Maximum amount of memory that can be requested for any single job.
string
256.GB
^\d+(\.\d+)?\.?\s*(K|M|G|T)?B$
Use to set an upper-limit for the memory requirement for each process. Should be a string in the format integer-unit e.g. --max_memory '8.GB'
Maximum amount of time that can be requested for any single job.
string
240.h
^(\d+\.?\s*(s|m|h|day)\s*)+$
Use to set an upper-limit for the time requirement for each process. Should be a string in the format integer-unit e.g. --max_time '2.h'
Less common options for the pipeline, typically set in a config file.
Display help text.
boolean
Method used to save pipeline results to output directory.
string
The Nextflow publishDir
option specifies which intermediate files should be saved to the output directory. This option tells the pipeline what method should be used to move these files. See Nextflow docs for details.
Email address for completion summary, only when pipeline fails.
string
^([a-zA-Z0-9_\-\.]+)@([a-zA-Z0-9_\-\.]+)\.([a-zA-Z]{2,5})$
An email address to send a summary email to when the pipeline is completed - ONLY sent if the pipeline does not exit successfully.
Do not use coloured log outputs.
boolean
Directory to keep pipeline Nextflow logs and reports.
string
${params.outdir}/pipeline_info
Boolean whether to validate parameters against the schema at runtime
boolean
true
Show all params when using --help
boolean
Run this workflow with Conda. You can also use '-profile conda' instead of providing this parameter.
boolean
Instead of directly downloading Singularity images for use with Singularity, force the workflow to pull and convert Docker containers instead.
boolean
This may be useful for example if you are unable to directly pull Singularity containers to run the pipeline due to http/https proxy issues.
E-mail address for optional workflow completion notification.
string
Send plain-text email instead of HTML.
boolean
NA
string