This project will add modules to the nf-core/proteinannotator pipeline, building towards a 1.0.0 release!

Vision

Build the best protein annotator in the world.

Protein fasta -> ??? -> Profit!

  • the ??? is nf-core/proteinannotator
  • We want to build the pipeline of choice by the people sequencing the genomes of new creatures to annotate protein fasta files with function
  • Future options include using synteny of genes, but that is beyond the 1.0.0 release

In-progress annotation tools

So far, we have these PRs in progress that we could use your help on!

  • #9: InterProScan — Started by @olgabot, will work on during the hackathon — looking for contributors to update InterProScan on nf-core/modules
  • #14: Convert Fasta to Parquet files to compute amino acid composition stats using FastaToParquet from heuermh/dishevelled-bio, started by @hueuermh
  • #17: UniProt’s UniFire — Instructions, started at the March hackathon and looking for more contributors!
    • This may end up needing to be written as its own subworkflow because the UniFire container from EBI runs its own internal pipeline that duplicates work, e.g. it runs InterProScan internally and uses the output from that for further analysis
  • #18: DIAMOND-blastp, started at the March hackathon and looking for more contributors!

TODO annotation tools

  • proteinfold + FoldSeek — will require folding protein structures, e.g. with ESMFold2 or AlphaFold2 — looking for contributors!
  • … Another tool you suggest …?

Plus any of the below!

We welcome contributors of all experience levels.

Similar pipelines

Below are pipelines that also process protein fasta files and add either functional or structural information to them, but don’t have exactly the same purpose as proteinannotator. We will likely use their modules.

  • funcscan to search (meta)genomic nucleotide data for functional protein sequences, e.g. for biosynthetic gene clusters, antimicrobial peptide genes, and antimicrobial resistance genes
  • reportho to compare ortholog predictions across methods
  • proteinfamilies to cluster protein sequences into families, and updates existing families with new sequences
  • proteinfold to fold protein sequences with ESMFold, AlphaFold2
nf-core proteinannotator hackathon
category
pipelines
group leader