nf-core/fetchngs
Pipeline to fetch metadata and raw FastQ files from public databases
Version history
[1.12.0] - 2024-02-29
⚠️ Major enhancements
- The Aspera CLI was recently added to Bioconda and we have added it as another way of downloading FastQ files in addition to the existing FTP and sra-tools support. In our limited benchmarks on all public Clouds we found ~50% speed-up in download times compared to FTP! FTP downloads will still be the default download method (i.e.
--download_method ftp
) but you can choose to use sra-tools or Aspera using--download_method sratools
or--download_method aspera
, respectively. We would love to have your feedback! - The
--force_sratools_download
parameter has been deprecated in favour of using--download_method <method>
to explicitly specify the download method; available options areftp
,sratools
oraspera
. - Support for Synapse ids has been dropped in this release. We haven’t had any feedback from users whether it is being used or not. Users can run earlier versions of the pipeline if required.
- We have significantly refactored and standardised the way we are using nf-test within this pipeline. This pipeline is now the current, best-practice implementation for nf-test usage on nf-core. We required a number of features to be added to nf-test and a huge shoutout to Lukas Forer for entertaining our requests and implementing them within upstream ❤️!
Credits
Special thanks to the following for their contributions to the release:
- Adam Talbot
- Alexandru Mizeranschi
- Alexander Blaessle
- Lukas Forer
- Matt Niederhuber
- Maxime Garcia
- Sateesh Peri
- Sebastian Uhrig
Thank you to everyone else that has contributed by reporting bugs, enhancements or in any other way, shape or form.
Enhancements & fixes
- PR #238 - Resolved bug when prefetching large studies (#236)
- PR #241 - Use wget instead of curl to download files from FTP (#169, #194)
- PR #242 - Template update for nf-core/tools v2.11
- PR #243 - Fixes for PR #238
- PR #245 - Refactor nf-test CI and test and other pre-release fixes (#233)
- PR #246 - Handle dark/light mode for logo in GitHub README properly
- PR #248 - Update pipeline level test data path to use mirror on s3
- PR #249 - Update modules which includes absolute paths for test data, making module level test compatible within the pipeline.
- PR #253 - Add implicit tags in nf-test files for simpler testing strategy
- PR #257 - Template update for nf-core/tools v2.12
- PR #258 - Fixes for PR #253
- PR #259 - Add Aspera CLI download support to pipeline (#68)
- PR #261 - Revert sratools fasterqdump version (#221)
- PR #262 - Use nf-test version v0.8.4 and remove implicit tags
- PR #263 - Refine tags used for workflows
- PR #264 - Remove synapse workflow from pipeline
- PR #265 - Use ”+” syntax for profiles to accumulate profiles in nf-test
- PR #266 - Make .gitignore match template
- PR #268 - Add mermaid diagram
- PR #273 - Update utility subworkflows
- PR #283 - Template update for nf-core/tools v2.13
- PR #288 - Update Github Action to run full-sized test for all 3 download methods
- PR #290 - Remove mentions of deprecated Synapse functionality in pipeline
- PR #294 - Replace mermaid diagram with subway map
- PR #295 - Be less stringent with test expectations for CI
- PR #296 - Remove params.outdir from tests where required and update snapshots
- PR #298 -
export CONDA_PREFIX
into container when using Singularity and Apptainer
Software dependencies
Dependency | Old version | New version |
---|---|---|
wget | 1.20.1 |
NB: Dependency has been updated if both old and new version information is present.
NB: Dependency has been added if just the new version information is present.
NB: Dependency has been removed if new version information isn’t present.
Parameters
Old parameter | New parameter |
---|---|
--download_method | |
--input_type | |
--force_sratools_download | |
--synapse_config |
NB: Parameter has been updated if both old and new parameter information is present. NB: Parameter has been added if just the new parameter information is present. NB: Parameter has been removed if new parameter information isn’t present.
What’s Changed
- remove public_aws_ecr by @maxulysse in #185
- Fix tests by @maxulysse in #187
- Adds emit statement for FASTQs and metadata to SRA workflow by @adamrtalbot in #184
- split up config files to be more modular by @maxulysse in #186
- Move out multiQC and versions by @maxulysse in #189
- tiny refactor by @maxulysse in #190
- Update SRA workflow tests by @maxulysse in #191
- update tests by @maxulysse in #192
- FEAT: add changes by @maxulysse in #193
- Recursively inherit configs by @adamrtalbot in #195
- remove all the nf-test logic from the refactor branch by @maxulysse in #198
- restore nf-test tests by @maxulysse in #200
- forgot this file by @maxulysse in #202
- fix path to file to include and update snapshots by @maxulysse in #203
- Trying out initialise by @maxulysse in #204
- Bump pipeline version to 1.11.0dev by @drpatelh in #211
- nf-test POC by @maxulysse in #201
- Per module/subworkflow tags.yml file by @adamrtalbot in #212
- Remove lib directory and replace with atomic subworkflows by @drpatelh in #213
- update modules and tests + fix linting by @maxulysse in #214
- FIX: custom/dumpsoftwareversions by @maxulysse in #215
- add pipeline level tests by @maxulysse in #216
- Tag and path updates for nf-test files by @drpatelh in #217
- Update workflows tests by @maxulysse in #218
- Update modules + tests by @maxulysse in #219
- Use nf-core nfvalidation subworkflow by @adamrtalbot in #222
- Update nextflowpipelineutils by @adamrtalbot in #224
- Use nf-core subworkflow: NFCORE_PIPELINE_UTILS by @adamrtalbot in #223
- Fix all by @maxulysse in #225
- fix sratools by @maxulysse in #227
- Replace CUSTOM_DUMPSOFTWAREVERSIONS with collectFile operator by @adamrtalbot in #226
- fix sratools and fewer ids by @maxulysse in #228
- Prepare 1.11.0 RC by @maxulysse in #230
- Refactor POC by @maxulysse in #188
- Release candidate 1.11.0 by @maxulysse in #231
Full Changelog: 1.10.1…1.11.0
[1.10.1] - 2023-10-08
Credits
Special thanks to the following for their contributions to the release:
Thank you to everyone else that has contributed by reporting bugs, enhancements or in any other way, shape or form.
Enhancements & fixes
- #173 - Add compatibility for sralite files
- PR #205 - Rename all local modules, workflows and remove
public_aws_ecr profile
- PR #206 - CI improvments and code cleanup
- PR #208 - Template update with nf-core/tools 2.10
Software dependencies
Dependency | Old version | New version |
---|---|---|
sra-tools | 2.11.0 | 3.0.8 |
NB: Dependency has been updated if both old and new version information is present.
NB: Dependency has been added if just the new version information is present.
NB: Dependency has been removed if new version information isn’t present.
[1.10.0] - 2023-05-16
Credits
Special thanks to the following for their contributions to the release:
Thank you to everyone else that has contributed by reporting bugs, enhancements or in any other way, shape or form.
Enhancements & fixes
- #85 - Not able to fetch metadata for ERR ids associated with ArrayExpress
- #104 - Add support back in for GEO IDs (removed in v1.7)
- #129 - Pipeline is working with SRA run ids but failing with corresponding Biosample ids
- #138 - Add support for downloading protected dbGAP data using a JWT file
- #144 - Add support to download 10X Genomics data
- PR #140 - Bumped modules version to allow for sratools download of sralite format files
- PR #147 - Updated pipeline template to nf-core/tools 2.8
- PR #148 - Fix default metadata fields for ENA API v2.0
- PR #150 - Add infrastructure and CI for multi-cloud full-sized tests run via Nextflow Tower
- PR #157 - Add
public_aws_ecr.config
to source mulled containers when usingpublic.ecr.aws
Docker Biocontainer registry
Software dependencies
Dependency | Old version | New version |
---|---|---|
synapseclient | 2.6.0 | 2.7.1 |
NB: Dependency has been updated if both old and new version information is present.
NB: Dependency has been added if just the new version information is present.
NB: Dependency has been removed if new version information isn’t present.
[1.9] - 2022-12-21
Enhancements & fixes
- Bumped minimum Nextflow version from
21.10.3
->22.10.1
- Updated pipeline template to nf-core/tools 2.7.2
- Added support for generating nf-core/atacseq compatible samplesheets
- Added
--nf_core_rnaseq_strandedness
parameter to specify value forstrandedness
entry added to samplesheet created when using--nf_core_pipeline rnaseq
. The default isauto
which can be used with nf-core/rnaseq v3.10 onwards to auto-detect strandedness during the pipeline execution.
[1.8] - 2022-11-08
Enhancements & fixes
- #111 - Change input mimetype to csv
- #114 - Final samplesheet is not created when
--skip_fastq_download
is provided - #118 - Allow input pattern validation for csv/tsv/txt
- #119 -
--force_sratools_download
results in different fastq names compared to FTP download - #121 - Add
tower.yml
to render samplesheet as Report in Tower - Fetch
SRR
andDRR
metadata from ENA API instead of NCBI API to bypass frequent breaking changes - Updated pipeline template to nf-core/tools 2.6
[1.7] - 2022-07-01
⚠️ Major enhancements
Support for GEO ids has been dropped in this release due to breaking changes introduced in the NCBI API. For more detailed information please see this PR.
As a workaround, if you have a GEO accession you can directly download a text file containing the appropriate SRA ids to pass to the pipeline:
- Search for your GEO accession on GEO
- Click
SRA Run Selector
at the bottom of the GEO accession page - Select the desired samples in the
SRA Run Selector
and then download theAccession List
This downloads a text file called SRR_Acc_List.txt
that can be directly provided to the pipeline e.g. --input SRR_Acc_List.txt
.
Enhancements & fixes
[1.6] - 2022-05-17
- #57 - fetchngs fails if FTP is blocked
- #89 - Improve detection and usage of the NCBI user settings by using the standardized sra-tools modules from nf-core.
- #93 - Adjust modules configuration to respect the
publish_dir_mode
parameter. - [nf-core/rnaseq#764] - Test fails when using GCP due to missing tools in the basic biocontainer
- Updated pipeline template to nf-core/tools 2.4.1
Software dependencies
Dependency | Old version | New version |
---|---|---|
synapseclient | 2.4.0 | 2.6.0 |
[1.5] - 2021-12-01
- Finish porting the pipeline to the updated Nextflow DSL2 syntax adopted on nf-core/modules
- Bump minimum Nextflow version from
21.04.0
->21.10.3
- Removed
--publish_dir_mode
as it is no longer required for the new syntax
- Bump minimum Nextflow version from
[1.4] - 2021-11-09
Enhancements & fixes
- Convert pipeline to updated Nextflow DSL2 syntax for future adoption across nf-core
- Added a workflow to download FastQ files and to create samplesheets for ids from the Synapse platform hosted by Sage Bionetworks.
- SRA identifiers not available for direct download via the ENA FTP will now be downloaded via
sra-tools
. - Added
--force_sratools_download
parameter to preferentially download all FastQ files viasra-tools
instead of ENA FTP. - Correctly handle errors from SRA identifiers that do not return metadata, for example, due to being private.
- Retry an error in prefetch via bash script in order to allow it to resume interrupted downloads.
- Name output FastQ files by
{EXP_ACC}_{RUN_ACC}*fastq.gz
instead of{EXP_ACC}_{T*}*fastq.gz
for run id provenance - [#46] - Bug in sra_ids_to_runinfo.py
- Added support for DDBJ ids. See examples below:
DDBJ |
---|
PRJDB4176 |
SAMD00114846 |
DRA008156 |
DRP004793 |
DRR171822 |
DRS090921 |
DRX162434 |
[1.3] - 2021-09-15
Enhancements & fixes
- Replaced Python
requests
withurllib
to fetch ENA metadata
Software dependencies
Dependency | Old version | New version |
---|---|---|
python | 3.8.3 | 3.9.5 |
[1.2] - 2021-07-28
Enhancements & fixes
- Updated pipeline template to nf-core/tools 2.1
- [#26] - Update broken EBI API URL
[1.1] - 2021-06-22
Enhancements & fixes
[1.0] - 2021-06-08
Initial release of nf-core/fetchngs, created with the nf-core template.
Pipeline summary
Via a single file of ids, provided one-per-line the pipeline performs the following steps:
- Resolve database ids back to appropriate experiment-level ids and to be compatible with the ENA API
- Fetch extensive id metadata including direct download links to FastQ files via ENA API
- Download FastQ files in parallel via
curl
and performmd5sum
check - Collate id metadata and paths to FastQ files in a single samplesheet
Supported database ids
Currently, the following types of example identifiers are supported:
SRA | ENA | GEO |
---|---|---|
SRR11605097 | ERR4007730 | GSM4432381 |
SRX8171613 | ERX4009132 | GSE147507 |
SRS6531847 | ERS4399630 | |
SAMN14689442 | SAMEA6638373 | |
SRP256957 | ERP120836 | |
SRA1068758 | ERA2420837 | |
PRJNA625551 | PRJEB37513 |