nf-core/proteinfamilies
Generation and update of protein families
metagenomicsprotein-familiesproteomics
Version history
Changed
- #77 - Default branch changed from
master
tomain
. - #73 - Changed the fasta parsing library of the
CHUNK_CLUSTERS
local module, frompyfastx
back to the latest version ofbiopython
, and parallelized its writing mechanism, achieving decreased execution time.
Dependencies
Tool | Previous version | New version |
---|---|---|
biopython | 1.84 | 1.85 |
pyfastx | 2.2.0 |
Removed
- #73 - Deprecated
pyfastx
module version ofCHUNK_CLUSTERS
, since it was struggling performance-wise with larger datasets.
Added
- #69 - Added the
hhsuite/reformat
nf-core module to reformat.sto
alignments to.fas
when in-family sequence redundancy is not removed. Also added the option to save intermediate and final family fasta files throughout the workflow with varioussave
parameters. - #58 - Added nf-test and
meta.yml
file for local moduleREMOVE_REDUNDANCY_SEQS
(Hackathon 2025) - #56 - Added nf-test and
meta.yml
file for local moduleFILTER_RECRUITED
(Hackathon 2025) - #55 - Added nf-test and
meta.yml
file for local moduleCHUNK_CLUSTERS
(Hackathon 2025) - #54 - Added nf-test for local subworkflow
ALIGN_SEQUENCES
(Hackathon 2025) - #53 - Added nf-test for local subworkflow
EXECUTE_CLUSTERING
(Hackathon 2025) - #51 - Added nf-test and
meta.yml
file for local moduleCALCULATE_CLUSTER_DISTRIBUTION
(Hackathon 2025) - #34 - Added the
EXTRACT_UNIQUE_CLUSTER_REPS
module, that calculates initialMMseqs
clustering metadata, for each sample, to print withMultiQC
(Id,Cluster Size,Number of Clusters)
Fixed
- #69 - Fixed a bug where redundant family alignments were not published properly, if intra-family redundancy removal mechanism was switched off #68
- #65 - Fixed a bug in
CHUNK_CLUSTERS
, where pipeline would crash if the module filtered out all clusters, due to a high membership threshold #64 - #35 - Fixed a bug in
remove_redundant_fams.py
, where comparison was between strings instead of integers to keep larger family - #33 - Fixed an always-true condition at the
filter_non_redundant_hmms.py
script, by adding missing parentheses - #29 - Fixed
hmmalign
empty input crash error, by preventing theFILTER_RECRUITED
module from creating an empty output .fasta.gz file, when there are no remaining sequences after filtering thehmmsearch
results #28
Changed
- #69 - Changed the publish directory architecture for HMMs, seed MSAs, full MSAs and family FASTA files, to make it more intuitive.
REMOVE_REDUNDANT_FAMS
local module converted toIDENTIFY_REDUNDANT_FAMS
to extract redundant family ids which will then be used downstream.FILTER_NON_REDUNDANT_HMMS
local module converted toFILTER_NON_REDUNDANT_FAMS
and reused four times (HMM, seed MSA, full MSA, FASTA). Changed the output format of theEXTRACT_FAMILY_REPS
andREMOVE_REDUNDANT_SEQS
local modules from.fa
to.faa
. Metro map updated with newhhsuite/reformat
module. - #57 - slight improvements of
nextflow_schema.json
(Hackathon 2025) - #57 - slight improtmenets of
assets/schema_input.json
(Hackathon 2025) - #34 - Swapped the
SeqIO
python library withpyfastx
for theCHUNK_CLUSTERS
module, quartering its duration - #32 - Updated
ClipKIT
2.4.0 -> 2.4.1, that now also allows ends-only trimming, to completely replace the customCLIP_ENDS
module. Users can now also define its output format by setting the--clipkit_out_format
parameter (default:clipkit
)
Dependencies
Tool | Previous version | New version |
---|---|---|
ClipKIT | 2.4.0 | 2.4.1 |
pyfastx | 2.2.0 | |
hhsuite | 3.3.0 | |
multiqc | 1.27 | 1.28 |
Deprecated
- #32 - Deprecated
CLIP_ENDS
module and--clipping_tool
parameter. The only option now isClipKIT
, covering both previous modes, via setting--trim_ends_only
Initial release of nf-core/proteinfamilies, created with the nf-core template.
Added
- Amino acid sequence clustering (mmseqs)
- Multiple sequence alignment (famsa, mafft, clipkit)
- Hidden Markov Model generation (hmmer)
- Between families redundancy removal (hmmer)
- In-family sequence redundancy removal (mmseqs)
- Family updating (hmmer, seqkit, mmseqs, famsa, mafft, clipkit)
- Family statistics presentation (multiqc)
By @vagkaratzas and @mberacochea.