nf-core/genomeannotator
Pipeline for the identification of (coding) gene structures in draft genomes.
Define where the pipeline should find input data and save output data.
Path to the genome assembly.
string
^\S+\.fn?a(sta)?$
This is the assembly you wish to annotate.
The output directory where the results will be saved. You have to use absolute paths to storage on Cloud infrastructure.
string
Email address for completion summary.
string
^([a-zA-Z0-9_\-\.]+)@([a-zA-Z0-9_\-\.]+)\.([a-zA-Z]{2,5})$
Set this parameter to your e-mail address to get a summary e-mail with details of the run sent to you when the workflow exits. If set in your user config file (~/.nextflow/config
) then you don't need to specify this on the command line for every run.
MultiQC report title. Printed as page header, used for filename if not otherwise specified.
string
Path to samplesheet for RNAseq data.
string
^\S+\.csv$
If you wish to include RNAseq data, you will need to create a samplesheet in CSV format. Use this parameter to specify its location. It has to be a comma-separated file with 4 columns, and a header row.
Path to a fasta file with proteins
string
^\S+\.fn?a(sta)?$
Specify a fasta-formatted file with proteins from related organisms. Typical sources are Uniprot, EnsEMBL or Refseq.
Path to a fasta file with proteins
string
^\S+\.fn?a(sta)?$
Specify a fasta-formatted file with proteins your organism of interest. Typical sources are Uniprot, EnsEMBL or Refseq.
Path to a fasta file with transcripts/ESTs
string
^\S+\.fn?a(sta)?$
Specify a fasta-formatted file with transcripts/ESTs from your organism of interest. Typical sources are ENA and dbEST.
Path to a fasta file with known repeat sequences for this organism
string
^\S+\.fn?a(sta)?$
Specify a fasta-formatted file with repeat sequences for this organism. Typical sources are databases (NCBI, GRINST) or RepeatModeler.
Path to samplesheet for Reference genomes and annotations.
string
^\S+\.csv$
If you wish to If you wish to include annotations from related species (lift-over), you will need to create a samplesheet in CSV format. Use this parameter to specify its location. It has to be a comma-separated file with 3 columns, and a header row.
Options that control pipeline behavior
Chunk size for splitting the assembly.
integer
200000000
The assembly will split into pieces of this size, in bp, to increase parallelization.
Maximum length of expected introns in bp.
integer
This option specifies the longest expected intron in base-pairs. Seeting this too low will result in broken gene models. Conversely, setting this too large may create unreasonable gene models and increase run time.
Minimum size of contig to consider
integer
5000
Small contigs will typically not add anything to the annotation, but can increase run time or trigger crashes. This value determines the cutoff for contig inclusion.
Taxonomic group to guide repeat masking.
string
Use this taxonomic group or species to identify and mask repeats. Valid names can, in most cases, be guessed, and follow the nomenclature provided through the NCBI taxonomy. This option draws from available data included in DFam 3.2, which contains HMM profiles for over 273.0000 repeat families from 347 species.
A database of curated repeats in EMBL format.
string
https://www.dfam.org/releases/Dfam_3.5/families/Dfam_curatedonly.h5.gz
^\S+\.gz$
This option points to the DFam database (h5 format) of curated repeats for RepeatMasker. By default, the pipeline will get it on-the-fly from the DFam server. You can pre-download the file (.gz) and provide it via this option.
Name of a BUSCO taxonomic group to evaluate the completeness of annotated gene set(s).
string
Use this to provide the name of a BUSCO taxonomic group against which to evaluate the resulting gene builds. Format should be taxgroup_odb10 (i.e. without the date).
Path to the local BUSCO data.
string
Use this to provide the path to a local copy of the busco database (usually /path/to/busco_downloads). For details, see the BUSCO documentation.
A placeholder gff file to help trigger certain processes.
string
PIPELINE_BASE/assets/empty.gff3
Options that control gene finding with AUGUSTUS
AUGUSTUS species model to use.
string
Specify which model AUGUSTUS will run with. A full list is available here: https://github.com/Gaius-Augustus/Augustus/blob/master/docs/RUNNING-AUGUSTUS.md
Options to pass to AUGUSTUS.
string
--alternatives-from-evidence=on --minexonintronprob=0.08 --minmeanexonintronprob=0.4 --maxtracks=3
AUGUSTUS has many options that are not specifically available as pipeline options. Instead, you can pass them through this flag.
Location of the AUGUSTUS config directory within the docker container
string
/usr/local/config
This option specifies where to find the AUGUSTUS config directory inside the Docker container. Normally, you should not change this!
A custom config directory for AUGUSTUS
string
Use this to point to a custom AUGUSTUS config directory - for example if you have trained a new model outside of GENOMEANNOTATOR. Most be compatible with AUGUSTUS 3.4.
Custom AUGUSTUS extrinsic config file path
string
Provide a custom extrinsic config file to AUGUSTUS, specifying the weight of different types if evidence. We suggest you start with our built-in base version.
Length of annotation chunks in AUGUSTUS
integer
3000000
This value determines the length of a region worked on by each AUGUSTUS sub process. The overlap between neighboring chunks is 1/6 the chunk length. The default value should be adequate for most scenarios.
Enable training of a new AUGUSTUS profile.
boolean
This option enables training of a new AUGUSTUS prediction profile. You must provide either a full (!) species-specific proteome via --proteins_targeted or a sufficiently comprehensive set of transcripts/RNA-seq data. When both are provided, proteins will be preferred.
Priority for protein-derived hints for gene building.
integer
3
This value determines the priority protein-derived hints are given during AUGUSTUS gene finding. The higher the value, the more important the hint (1-5).
Priority for targeted protein evidences
integer
5
A value to determine the weight of this type of evidence (1-5). A higher value means this type of evidence is given more consideration.
Priority for transcript evidences
integer
4
A value to determine the weight of this type of evidence (1-5). A higher value means this type of evidence is given more consideration.
Priority for RNAseq splice junction evidences
integer
4
A value to determine the weight of this type of evidence (1-5). A higher value means this type of evidence is given more consideration.
Priority for RNAseq exon coverage evidences
integer
2
A value to determine the weight of this type of evidence (1-5). A higher value means this type of evidence is given more consideration.
Priority for trans-mapped gene model evidences
integer
4
A value to determine the weight of this type of evidence (1-5). A higher value means this type of evidence is given more consideration.
Evidence label for transcriptome data
string
E
A label for a given type of evidence - corresponds to labels in the AUGUSTUS extrinsic config file. Should not be changed.
Evidence label for protein data
string
P
A label for a given type of evidence - corresponds to labels in the AUGUSTUS extrinsic config file. Should not be changed.
Evidence label for RNAseq data
string
E
A label for a given type of evidence - corresponds to labels in the AUGUSTUS extrinsic config file. Should not be changed.
Options that control processing of protein evidences
Taxon model to use for SPALN protein alignments.
string
This option specifies which SPALN alignment model to use. For a full list of available models, see: https://github.com/ogotoh/spaln/blob/master/table/gnm2tab
SPALN custom options.
string
-M
USers can pass custom options to the SPALN alignment process. Normally, this will not be necessary!
SPALN id threshold for aligning.
integer
60
Users can pass custom id threshold to the SPALN alignment process. Normally, this will not be necessary!
Minimum size of a protein sequence to be included.
integer
35
Protein-Databases often contain fragmented protein sequences. Use this option to filter out very small proteins from your evidence set.
Numbe of proteins per alignment job.
integer
200
Specifies the number of proteins per alignnment job. This option controls parallelism - the higher this number, the fewer jobs are created and the longer the individual run times. Only increase if you have a very large number of proteins to process. The default value should be fine though.
Q value for the SPALN alignment algorithm.
integer
5
ID threshold for targeted protein alignments.
integer
90
Options that control the PASA transcriptome annotation pipeline
Number of PASA models to select for AUGUSTUS training.
integer
1000
Built-in config file for PASA.
string
PIPELINE_BASE/assets/pasa/alignAssembly.config
Options that control the EvidenceModeler pipeline
Weights file for EVM.
string
None
Number of EVM jobs per chunk.
integer
10
Options that control individual tool behavior
Activate the trinity assembly sub-pipeline
boolean
Assemble short-reads into transcripts using Trinity
Activate the PASA sub-pipeline
boolean
Assemble into gene models using PASA.
Activate the EvidenceModeler sub-pipeline
boolean
Perform consensus gene building using EvidenceModeler.
Activate search for ncRNAs with RFam/infernal
boolean
Perform prediction of non-coding RNAs using CM profiles from Rfam release 14.
Parameters used to describe centralised config profiles. These should not be edited.
Git commit id for Institutional configs.
string
master
Base directory for Institutional configs.
string
https://raw.githubusercontent.com/nf-core/configs/master
If you're running offline, Nextflow will not be able to fetch the institutional config files from the internet. If you don't need them, then this is not a problem. If you do need them, you should download the files from the repo and tell Nextflow where to find them with this parameter.
Institutional config name.
string
Institutional config description.
string
Institutional config contact information.
string
Institutional config URL link.
string
Set the top limit for requested resources for any single job.
Maximum number of CPUs that can be requested for any single job.
integer
16
Use to set an upper-limit for the CPU requirement for each process. Should be an integer e.g. --max_cpus 1
Maximum amount of memory that can be requested for any single job.
string
128.GB
^\d+(\.\d+)?\.?\s*(K|M|G|T)?B$
Use to set an upper-limit for the memory requirement for each process. Should be a string in the format integer-unit e.g. --max_memory '8.GB'
Maximum amount of time that can be requested for any single job.
string
240.h
^(\d+\.?\s*(s|m|h|day)\s*)+$
Use to set an upper-limit for the time requirement for each process. Should be a string in the format integer-unit e.g. --max_time '2.h'
Less common options for the pipeline, typically set in a config file.
Display help text.
boolean
Method used to save pipeline results to output directory.
string
The Nextflow publishDir
option specifies which intermediate files should be saved to the output directory. This option tells the pipeline what method should be used to move these files. See Nextflow docs for details.
Email address for completion summary, only when pipeline fails.
string
^([a-zA-Z0-9_\-\.]+)@([a-zA-Z0-9_\-\.]+)\.([a-zA-Z]{2,5})$
An email address to send a summary email to when the pipeline is completed - ONLY sent if the pipeline does not exit successfully.
Send plain-text email instead of HTML.
boolean
File size limit when attaching MultiQC reports to summary emails.
string
25.MB
^\d+(\.\d+)?\.?\s*(K|M|G|T)?B$
Do not use coloured log outputs.
boolean
Custom config file to supply to MultiQC.
string
Directory to keep pipeline Nextflow logs and reports.
string
${params.outdir}/pipeline_info
Boolean whether to validate parameters against the schema at runtime
boolean
true
Show all params when using --help
boolean
Run this workflow with Conda. You can also use '-profile conda' instead of providing this parameter.
boolean