Version history

Changed

  • #170 - Updated local modules to versions topic output. (by @vagkaratzas)
  • #171
    • Updated nf-core modules and subworkflows to latest, removing all remaining ch_versions. (by @vagkaratzas)
    • Updated pipeline-level nf-schema to 2.7.2. (by @vagkaratzas)

Dependencies

Tool Previous version New version
multiqc 1.34 1.35

Added

  • #159 - Added functionality to generate a HMM library file (compressed) with its respective final protein families HMMs, per each input sample row. Family library files can be found at hmm/library (by @juanfmx2) (Hackathon 2026)
  • #154 - Added optional save parameters for update_families mode: --save_update_families_pre_clipped_fasta, and --save_update_families_clipped_fasta (with gaps removed) to save FASTA files from updated family MSAs at various stages of the subworkflow (update_families/fasta/pre_clipped/, update_families/fasta/pre_clipped_non_redundant_sequences/, and update_families/fasta/post_clipped/). (by @eparisis) (Hackathon 2026)

Changed

  • #162 - nf-core tools template update to 4.0.2. (by @vagkaratzas)
  • #154 - Moved update_families non-redundant sequence FASTA output from mmseqs/update_families/non_redundant_sequences/ to update_families/fasta/pre_clipped_non_redundant_sequences/${meta.id}/ (now controlled by --save_update_families_pre_clipped_fasta). (by @eparisis) (Hackathon 2026)
  • #153 - Typo fixes and unused boilerplate removal. (by @vagkaratzas)
  • #152
    • Removed an always-truthy if condition. (by @vagkaratzas)
    • Moved a wrongly placed .first() to now properly report all sample representative sequences on the MultiQC report. (by @vagkaratzas)
  • #150 - Pipeline logos and workflow maps updated. (by @vagkaratzas)
  • #149 - Updated the nf-core/proteinfamilies article citation to the GigaScience publication. (by @vagkaratzas)
  • #147 - Modules, subworkflows and pipeline code updates to fix linting warnings and errors. (by @vagkaratzas)

Dependencies

Tool Previous version New version
clipkit 2.4.1 2.11.4
multiqc 1.33 1.34

Added

  • #143
    • Added the cmaple module for optional phylogenetic tree inference for final family full MSAs. (by @vagkaratzas)
    • Added extra nf-tests for the REMOVE_REDUNDANCY subworkflow. (by @vagkaratzas)
  • #140 - Using the new workflow output syntax to publish the downstream nf-core/proteinannotator samplesheet. (by @vagkaratzas)

Changed

  • #143 - Updated metro-maps, citations, README.md and output.md to include cmaple phylogenetic trees. (by @vagkaratzas)
  • #141 - Aligning the default output directory of the new output syntax with the publishing directory of the pipeline. (by @vagkaratzas)
  • #140
    • Updated metro-map to also depict the optional creation of downstream samplesheets for nf-core/proteinfold and nf-core/proteinannotator. Also swapped seqkit/rmdup and seqkit/replace modules to their proper execution sequence. (by @vagkaratzas)
    • Updated the test_full profile time and memory requirements to avoid AWS failures on release. (by @vagkaratzas)

Fixed

  • #143
    • Fixed a bug in REMOVE_REDUNDANCY subworkflow, where the combination of these skip flags --skip_sequence_redundancy_removal true, --skip_additional_sequence_recruiting true and --skip_additional_sequence_recruiting false, would execute HHSUITE_REFORMAT_FILTERED to reformat Stockholm alignments while they were already in fasta (or clipkit) format. (by @vagkaratzas)
    • Fixed a bug in REMOVE_REDUNDANCY subworkflow, where unmerged similar families would be removed along with redundant ones, when --skip_family_merging was true and --skip_family_redundancy_removal was false. (by @vagkaratzas)

Dependencies

Tool Previous version New version
multiqc 1.32 1.33
cmaple - 1.1.0

Added

  • #133 - Using the new workflow output syntax to publish the downstream nf-core/proteinfold samplesheet. (by @vagkaratzas)
  • #132 - Added optimized memory and time resources for test and test_full profiles. (by @vagkaratzas)

Changed

  • #136 - Based on protein family reproducibility benchmarks, cluster is now the default MMseqs2 mode, due to its increased sensitivity compared to linclust. (by @vagkaratzas)
  • #135 - nf-core tools template update to 3.5.1. (by @vagkaratzas)

Dependencies

Tool Previous version New version
mmseqs 17.b804f 18.8cc5c

Special Thanks

To @jfy133 and @JoseEspinosa for their assistance with all queries regarding chaining nf-core/proteinfamilies to nf-core/proteinfold.

Added

  • #124
    • Added new subworkflow MERGE_FAMILIES that can optionally merge similar (but not redundant) generated protein families. (by @vagkaratzas)
    • Added new functionality to the local module IDENTIFY_REDUNDANT_FAMS which now also detects and outputs the identifiers of similar families that can optionally be merged downstream. These identifiers are written to “/remove_redundancy/<samplename>/similar_fam_ids.txt”, and the corresponding family pairwise similarity scores to “/remove_redundancy/<samplename>/similarities.csv”. (by @vagkaratzas)
    • Added new local module POOL_SIMILAR_COMPONENTS that generates family clusters, from a family-similarity edgelist. (by @vagkaratzas)
    • Added new local module MERGE_SEEDS that merges seed alignments of similar families, before restarting the family generation subworkflow. (by @vagkaratzas)
  • #118
    • Added preprint citation to the repo. (by @vagkaratzas)
    • Added separate metro map files for dark and light browser modes. (by @vagkaratzas)
    • Added new local module EXTRACT_FAMILY_MEMBERS which outputs a two-column TSV file containing the final family identifiers and their corresponding member sequence identifiers. The file is saved at “/family_reps/<samplename>/<samplename>.tsv”. (by @vagkaratzas)
  • #117
    • Added SEQKIT_SEQ for optional sequence preprocessing in the quality check subworkflow. (by @vagkaratzas)
    • Added SEQKIT_REPLACE for optional sequence name parsing in the quality check subworkflow. (by @vagkaratzas)
    • Added SEQKIT_RMDUP for optional removal of duplicate names and sequences in the quality check subworkflow. (by @vagkaratzas)

Changed

  • #128 - nf-core tools template update to 3.4.1.
  • #124
    • Conditional workflow flags switched to their skip opposites; --trim_msa to --skip_msa_trimming, --recruit_sequences_with_models to --skip_additional_sequence_recruiting, --remove_family_redundancy to --skip_family_redundancy_removal, --remove_sequence_redundancy to --skip_sequence_redundancy_removal. (by @vagkaratzas)
  • #118
    • Swapped the local CHECK_QUALITY subworkflow with the new nf-core one FAA_SEQFU_SEQKIT. (by @vagkaratzas)
    • Based on protein family reproducibility benchmarks (i.e., computationally reproducing manually curated protein family resources), the cluster_seq_identity and cluster_coverage parameter default values have been updated to 0.3 and 0.5 (down from 0.5 and 0.9) respectively. (by @vagkaratzas)
  • #117 - Swapped the local SEQKIT_STATS and the local SEQKIT_STATS_TO_MQC modules with the SEQFU_STATS one, which runs a bit faster and produces a MultiQC-ready output without the need for manual parsing. (by @vagkaratzas)

Dependencies

Tool Previous version New version
seqfu - 1.20.3
multiqc 1.30 1.31

Deprecated

  • #124 - Deprecated --trim_msa, --recruit_sequences_with_models, --remove_family_redundancy and --remove_sequence_redundancy. (by @vagkaratzas)

Special Thanks

To @jfy133, @erikrikarddaniel and @chrisAta for this version’s PR code reviews.

Fixed

  • #112 - Fixed a bug in EXTRACT_FAMILY_REPS, where all sequences were pasted into the family representative one, and updated the relevant local nf-test. (by @vagkaratzas)

Changed

  • #106 - Swapped the local EXECUTE_CLUSTERING subworkflow with the new nf-core MMSEQS_FASTA_CLUSTER one. (by @vagkaratzas)

Dependencies

Tool Previous version New version
multiqc 1.29 1.30

Changed

  • #104 - Pulling params from local subworkflows into main workflow.
  • #103 - Parallelized execution for the EXTRACT_FAMILY_REPS local module and changed its input from full_msa to fasta.
  • #100 - CAT_CAT module replaced with FIND_CONCATENATE to avoid large scale Argument list too long errors.
  • #98 - nf-core tools template update to 3.3.2.

Added

  • #105 - CHECK_QUALITY subworkflow added at the start of the pipeline. It utilizes the seqkit/stats nf-core module to generate a MultiQC-ready report with statistics for the input amino acid sequences. The metro-map has been updated to reflect this change.

Added

  • #93
    • Added nf-test and meta.yml file for local subworkflow GENERATE_FAMILIES.
    • Added nf-test and meta.yml file for local subworkflow REMOVE_REDUNDANCY.
    • Added nf-test and meta.yml file for local subworkflow UPDATE_FAMILIES.
  • #88
    • Added nf-test and meta.yml file for local module BRANCH_HITS_FASTA.
    • Added nf-test and meta.yml file for local module FILTER_NON_REDUNDANT_FAMS.
    • Added nf-test and meta.yml file for local module IDENTIFY_REDUNDANT_FAMS.
    • Added nf-test and meta.yml file for local module EXTRACT_FAMILY_REPS.
    • Added the default pipeline end-to-end nf-test.

Changed

  • #81 - nf-core tools template update to 3.3.1.

Fixed

  • #80 - Fixed a bug where, due to a missing check for equal family sizes, non-redundant families were erroneously marked as redundant through transitive relationships and were removed

Changed

  • #77 - Default branch changed from master to main.
  • #73 - Changed the fasta parsing library of the CHUNK_CLUSTERS local module, from pyfastx back to the latest version of biopython, and parallelized its writing mechanism, achieving decreased execution time.

Dependencies

Tool Previous version New version
biopython 1.84 1.85
pyfastx 2.2.0

Removed

  • #73 - Deprecated pyfastx module version of CHUNK_CLUSTERS, since it was struggling performance-wise with larger datasets.

Added

  • #69 - Added the hhsuite/reformat nf-core module to reformat .sto alignments to .fas when in-family sequence redundancy is not removed. Also added the option to save intermediate and final family fasta files throughout the workflow with various save parameters.
  • #58 - Added nf-test and meta.yml file for local module REMOVE_REDUNDANCY_SEQS (Hackathon 2025)
  • #56 - Added nf-test and meta.yml file for local module FILTER_RECRUITED (Hackathon 2025)
  • #55 - Added nf-test and meta.yml file for local module CHUNK_CLUSTERS (Hackathon 2025)
  • #54 - Added nf-test for local subworkflow ALIGN_SEQUENCES (Hackathon 2025)
  • #53 - Added nf-test for local subworkflow EXECUTE_CLUSTERING (Hackathon 2025)
  • #51 - Added nf-test and meta.yml file for local module CALCULATE_CLUSTER_DISTRIBUTION (Hackathon 2025)
  • #34 - Added the EXTRACT_UNIQUE_CLUSTER_REPS module, that calculates initial MMseqs clustering metadata, for each sample, to print with MultiQC (Id,Cluster Size,Number of Clusters)

Fixed

  • #69 - Fixed a bug where redundant family alignments were not published properly, if intra-family redundancy removal mechanism was switched off #68
  • #65 - Fixed a bug in CHUNK_CLUSTERS, where pipeline would crash if the module filtered out all clusters, due to a high membership threshold #64
  • #35 - Fixed a bug in remove_redundant_fams.py, where comparison was between strings instead of integers to keep larger family
  • #33 - Fixed an always-true condition at the filter_non_redundant_hmms.py script, by adding missing parentheses
  • #29 - Fixed hmmalign empty input crash error, by preventing the FILTER_RECRUITED module from creating an empty output .fasta.gz file, when there are no remaining sequences after filtering the hmmsearch results #28

Changed

  • #69 - Changed the publish directory architecture for HMMs, seed MSAs, full MSAs and family FASTA files, to make it more intuitive. REMOVE_REDUNDANT_FAMS local module converted to IDENTIFY_REDUNDANT_FAMS to extract redundant family ids which will then be used downstream. FILTER_NON_REDUNDANT_HMMS local module converted to FILTER_NON_REDUNDANT_FAMS and reused four times (HMM, seed MSA, full MSA, FASTA). Changed the output format of the EXTRACT_FAMILY_REPS and REMOVE_REDUNDANT_SEQS local modules from .fa to .faa. Metro map updated with new hhsuite/reformat module.
  • #57 - slight improvements of nextflow_schema.json (Hackathon 2025)
  • #57 - slight improtmenets of assets/schema_input.json (Hackathon 2025)
  • #34 - Swapped the SeqIO python library with pyfastx for the CHUNK_CLUSTERS module, quartering its duration
  • #32 - Updated ClipKIT 2.4.0 -> 2.4.1, that now also allows ends-only trimming, to completely replace the custom CLIP_ENDS module. Users can now also define its output format by setting the --clipkit_out_format parameter (default: clipkit)

Dependencies

Tool Previous version New version
ClipKIT 2.4.0 2.4.1
pyfastx 2.2.0
hhsuite 3.3.0
multiqc 1.27 1.28

Deprecated

  • #32 - Deprecated CLIP_ENDS module and --clipping_tool parameter. The only option now is ClipKIT, covering both previous modes, via setting --trim_ends_only

Initial release of nf-core/proteinfamilies, created with the nf-core template.

Added

  • Amino acid sequence clustering (mmseqs)
  • Multiple sequence alignment (famsa, mafft, clipkit)
  • Hidden Markov Model generation (hmmer)
  • Between families redundancy removal (hmmer)
  • In-family sequence redundancy removal (mmseqs)
  • Family updating (hmmer, seqkit, mmseqs, famsa, mafft, clipkit)
  • Family statistics presentation (multiqc)

By @vagkaratzas and @mberacochea.