Adding structure prediction modes to nf-core/proteinfold
This section provides guidance on adding new structure prediction modes, implemented via the --mode option, to nf-core/proteinfold.
Contributing
One of the great advantages of an nf-core pipeline is that the community can extend workflows to add new functionalities. In nf-core/proteinfold, this allows adding new protein structure prediction modules as they are released, while still leveraging the existing workflow infrastructure and reporting.
Please consider writing some code to become a nf-core contributor and expand the pipeline! Reach out to a maintainer of contributor for guidance :
We are all contactable at the #proteinfold_dev nf-core Slack channel. That’s the best place for person-to-person discussions over new additions to implement into the pipeline.
Locating pipeline sections
main.nf: This kicks off each--mode’s workflow once the databases have been prepared on the deployment infrastructure. Relevant parameters are passed fromparams.[mode_name](largely populated from globalnextflow.configparamswhich inheritsdbs.configdatabase locations) through to the[MODE_NAME]()workflow. The channels returned contain the relevantreport_inputmetrics, thetop_rank_model(i.e. the best structure from all inference runs), and standard software versioning info.subworkflows: largely used for mode-specific smaller set-up worklows, except for thepost_processingsubworkflow which will be detailed later.workflows/[mode_name].nf: the--mode’s workflow handles input channels of relevant databases, passes them to the local module that does the prediction work (RUN_[MODE_NAME]()) and maps the output from the underlying structure prediction to emitted channels ingested by the reporting modules.modules/local/run_[MODE_NAME]: this is where the bulk of the compute work is done. Each underlying structure prediction module is bundled with its own Dockerfile to setup the software in a container, and a/modules/local/run_[MODE_NAME]/main.nfto execute the container from nextflow.- input:
metacontains the metadata info of this sub-job, including theidcolumn from thesamplesheet.csvaccessed by{meta.id}.path(fasta)(or more flexible yaml or json) locates the biomolecular input sequence file, wherefasta.baseNamegives the underlying input file name (not theidlabel).path(features)is used to pass through multiple sequence alignment (MSA) data, in line with AlphaFold2’s features.pkl file.- Other
path()s largely locate the core AlphaFold sequence databases (or module specific variants thereof).
- output:
- Outputs are structured as a bundled
tupleof two objects, the first is alwaysmetacontaining the metadata labels, and thenpath()to various output data files useful to the end-user. The prediction module is called in a way that return files to the process’s current directory (.).
- Outputs are structured as a bundled
"""script block""":program: the script block calls the program from the Nextflow shell with the programs typical--flags, in whatever form (binaryorscript.py) the program is distributed from its codebase repository.extract_metrics.py: accesses the canonical data output formats from the structure prediction program and returns a core set of plain text.tsvmetric files.
- input:
bin/extract_metrics.py: a globally accessible program to go from serialised data into.tsvplaintext. It currently applies format specific extraction logic for.pkl,.jsonand.npzfiles. However, as the community adds more--modes to the pipeline, different programs could use the same compressed output format. In which caseextract_metrics.pyshould be refactored to match based on the passing the--modetoextract_metrics.py.subworkflows/local/post_processing.nf: thePOST_PROCESSING{}process sits after all possible[MODE_NAME]()workflows in themain.nf. It passes along visualisation options, metrics data files, and report templates (singleorcomparison). Those reports are created with theGENERATE_REPORT()orCOMPARE_STRUCTURES()/module/local/modules, respectively.bin/generate_[comparison]_report.pytakes the HTML templates atassets/[report|comparison]_template.htmland populates them with plots created inside these python scripts.
Process labelling
At the top of a module’s RUN_[MODE_NAME]{} process, there are a series of labels that allow the nextflow.config to pass the job to the appropriate resources on the compute cluster. label 'process_gpu' is very useful to specify the AI inference stages requiring GPU-intensive computation. Other processes can use default labels that request CPU resources and, once finished, will naturally cascade onto GPU-enabled steps due to Nextflow’s dataflow paradigm.
Processable structure prediction metrics
Metrics from AlphaFold-inspired protein structure prediction programs are structured in two ways: tabular or as a matrix (PAE values)
When contributing a new mode to proteinfold, functionality should be added to extract_metrics.py to access the canonical ouput files of the new program, and extract data into compliant .tsv files that can be easily processed by downstream plotting and MultiQC functions.