nf-core/deepmutscan
nf-core/deepmutscan is a reproducible, scalable, and community-curated pipeline for analyzing deep mutational scanning (DMS) data using shotgun DNA sequencing.
Introduction
nf-core/deepmutscan is a workflow designed for the analysis of deep mutational scanning (DMS) data. DMS enables researchers to experimentally measure the fitness effects of thousands of genes or gene variants simultaneously, helping to classify disease causing mutants in human and animal populations, to learn the fundamental rules of virus evolution, protein architecture, splicing, small-molecule interactions and many other phenotypes.
While DNA synthesis and sequencing technologies have advanced substantially, long open reading frame (ORF) targets still present major challenges for DMS studies. Shotgun DNA sequencing can be used to greatly speed up the inference of long ORF mutant fitness landscapes, theoretically at no expense in accuracy. We have designed the nf-core/deepmutscan pipeline to unlock the power of shotgun sequencing based DMS studies on long ORFs, to simplify and standardise the complex bioinformatics steps involved in data processing of such experiments – from read alignment to QC reporting and fitness landscape inferences.
The pipeline is built using Nextflow, a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It uses Docker/Singularity containers making installation trivial and results highly reproducible. The Nextflow DSL2 implementation of this pipeline uses one container per process which makes it much easier to maintain and update software dependencies. Where possible, these processes have been submitted to and installed from nf-core/modules in order to make them available to all nf-core pipelines, and to everyone within the Nextflow community!
On release, automated continuous integration tests run the pipeline on a full-sized dataset on the AWS cloud infrastructure. This ensures that the pipeline runs on AWS, has sensible resource allocation defaults set to run on real-world datasets, and permits the persistent storage of results to benchmark between pipeline releases and other analysis sources. The results obtained from the full-sized test can be viewed on the nf-core website.
Major features
- End-to-end analyses of various DMS data
- Modular, three-stage workflow: alignment → QC → error-aware fitness estimation
- Integration with popular statistical fitness estimation tools like DiMSum, Enrich2, rosace and mutscan
- Support of multiple mutagenesis strategies, e.g. by nicking with degenerate NNK and NNS codons
- Containerisation via Docker, Singularity and Apptainer
- Scalability across HPC and Cloud systems
- Monitoring of CPU, memory, and CO₂ usage
For more details on the pipeline and on potential future expansions, please consider reading our usage description.
Step-by-step pipeline summary
The pipeline processes deep mutational scanning (DMS) sequencing data in several stages:
- Alignment of reads to the reference open reading frame (ORF) (
BWA-mem) - Filtering of wildtype and erroneous reads (
samtools view) - Read merging for base error reduction (
vsearch merge,BWA-mem) - Mutation counting (
GATK AnalyzeSaturationMutagenesis) - DMS library quality control
- Data summarisation across samples
- Single nucleotide variant error correction (in development)
- Fitness estimation (in development)
Usage
If you are new to Nextflow and nf-core, please refer to this page on how to set-up Nextflow. Make sure to test your setup with -profile test before running the workflow on actual data.
First, prepare a samplesheet with your input/output data in which each row represents a pair of fastq files (paired end). This should look as follows:
sample,type,replicate,file1,file2
ORF1,input,1,/reads/forward1.fastq.gz,/reads/reverse1.fastq.gz
ORF1,input,2,/reads/forward2.fastq.gz,/reads/reverse2.fastq.gz
ORF1,output,1,/reads/forward3.fastq.gz,/reads/reverse3.fastq.gz
ORF1,output,2,/reads/forward4.fastq.gz,/reads/reverse4.fastq.gzSecondly, specify the gene or gene region of interest using a reference FASTA file via --fasta. Provide the exact codon coordinates using --reading_frame.
Now, you can run the pipeline using:
nextflow run nf-core/deepmutscan \
-profile <docker/singularity/.../institute> \
--input ./samplesheet.csv \
--fasta ./ref.fa \
--reading_frame 1-300 \
--outdir ./resultsThere are several optional parameters, some of which are currently in development. For more details and further functionality, please refer to the usage documentation.
Pipeline output
After execution, the pipeline creates the following directory structure:
results/
├── fastqc/ # Individual HTML reports for specified fastq files, raw sequencing QC
├── fitness/ # Merged variant count tables, fitness and error estimates, replicate correlations and heatmaps
├── intermediate_files/ # Raw alignments, raw and pre-filtered variant count tables, QC reports
├── library_QC/ # Sample-specific PDF visualizations: position-wise sequencing coverage, count heatmaps, etc.
├── multiqc/ # Shared HTML reports for all fastq files, raw sequencing QC
├── pipelineinfo/ # Nextflow helper files for timeline and summary report generation
├── timeline.html # Nextflow timeline for all tasks
└── report.html # Nextflow summary report incl. detailed CPU and memory usage per for all tasksFor a full overview of the output file types, please refer to the specific documentation.
Contributing
We welcome contributions from the community!
For technical challenges and feedback on the pipeline, please use our Github repository. Please open an issue or pull request to:
- Report bugs or solve data incompatibilities when running
nf-core/deepmutscan - Suggest the implementation of new modules for custom DMS workflows
- Help improve this documentation
If you are interested in getting involved as a developer, please consider joining our interactive #deepmutscan Slack channel (via this invite).
Credits
nf-core/deepmutscan was originally written by Benjamin Wehnert and Max Stammnitz at the Centre for Genomic Regulation, Barcelona, with the generous support of an EMBO Long-term Postdoctoral Fellowship and a Marie Skłodowska-Curie grant by the European Union.
If you use nf-core/deepmutscan in your analyses, please cite:
📄 Wehnert et al., bioRxiv preprint (coming soon)
Please also cite the nf-core framework:
📄 Ewels et al., Nature Biotechnology, 2020 https://doi.org/10.1038/s41587-020-0439-x
Scientific contact
For scientific discussions around the use of this pipeline (e.g. on experimental design or sequencing data requirements), please feel free to get in touch with us directly:
- Benjamin Wehnert — wehnertbenjamin@gmail.com
- Maximilian Stammnitz — maximilian.stammnitz@crg.eu