nf-core/fetchngs
Pipeline to fetch metadata and raw FastQ files from public and private databases
Version history
What’s Changed
- remove public_aws_ecr by @maxulysse in #185
- Fix tests by @maxulysse in #187
- Adds emit statement for FASTQs and metadata to SRA workflow by @adamrtalbot in #184
- split up config files to be more modular by @maxulysse in #186
- Move out multiQC and versions by @maxulysse in #189
- tiny refactor by @maxulysse in #190
- Update SRA workflow tests by @maxulysse in #191
- update tests by @maxulysse in #192
- FEAT: add changes by @maxulysse in #193
- Recursively inherit configs by @adamrtalbot in #195
- remove all the nf-test logic from the refactor branch by @maxulysse in #198
- restore nf-test tests by @maxulysse in #200
- forgot this file by @maxulysse in #202
- fix path to file to include and update snapshots by @maxulysse in #203
- Trying out initialise by @maxulysse in #204
- Bump pipeline version to 1.11.0dev by @drpatelh in #211
- nf-test POC by @maxulysse in #201
- Per module/subworkflow tags.yml file by @adamrtalbot in #212
- Remove lib directory and replace with atomic subworkflows by @drpatelh in #213
- update modules and tests + fix linting by @maxulysse in #214
- FIX: custom/dumpsoftwareversions by @maxulysse in #215
- add pipeline level tests by @maxulysse in #216
- Tag and path updates for nf-test files by @drpatelh in #217
- Update workflows tests by @maxulysse in #218
- Update modules + tests by @maxulysse in #219
- Use nf-core nfvalidation subworkflow by @adamrtalbot in #222
- Update nextflowpipelineutils by @adamrtalbot in #224
- Use nf-core subworkflow: NFCORE_PIPELINE_UTILS by @adamrtalbot in #223
- Fix all by @maxulysse in #225
- fix sratools by @maxulysse in #227
- Replace CUSTOM_DUMPSOFTWAREVERSIONS with collectFile operator by @adamrtalbot in #226
- fix sratools and fewer ids by @maxulysse in #228
- Prepare 1.11.0 RC by @maxulysse in #230
- Refactor POC by @maxulysse in #188
- Release candidate 1.11.0 by @maxulysse in #231
Full Changelog: 1.10.1…1.11.0
[1.10.1] - 2023-10-08
Credits
Special thanks to the following for their contributions to the release:
Thank you to everyone else that has contributed by reporting bugs, enhancements or in any other way, shape or form.
Enhancements & fixes
- #173 - Add compatibility for sralite files
- PR #205 - Rename all local modules, workflows and remove
public_aws_ecr profile
- PR #206 - CI improvments and code cleanup
- PR #208 - Template update with nf-core/tools 2.10
Software dependencies
Dependency | Old version | New version |
---|---|---|
sra-tools | 2.11.0 | 3.0.8 |
NB: Dependency has been updated if both old and new version information is present.
NB: Dependency has been added if just the new version information is present.
NB: Dependency has been removed if new version information isn’t present.
[1.10.0] - 2023-05-16
Credits
Special thanks to the following for their contributions to the release:
Thank you to everyone else that has contributed by reporting bugs, enhancements or in any other way, shape or form.
Enhancements & fixes
- #85 - Not able to fetch metadata for ERR ids associated with ArrayExpress
- #104 - Add support back in for GEO IDs (removed in v1.7)
- #129 - Pipeline is working with SRA run ids but failing with corresponding Biosample ids
- #138 - Add support for downloading protected dbGAP data using a JWT file
- #144 - Add support to download 10X Genomics data
- PR #140 - Bumped modules version to allow for sratools download of sralite format files
- PR #147 - Updated pipeline template to nf-core/tools 2.8
- PR #148 - Fix default metadata fields for ENA API v2.0
- PR #150 - Add infrastructure and CI for multi-cloud full-sized tests run via Nextflow Tower
- PR #157 - Add
public_aws_ecr.config
to source mulled containers when usingpublic.ecr.aws
Docker Biocontainer registry
Software dependencies
Dependency | Old version | New version |
---|---|---|
synapseclient | 2.6.0 | 2.7.1 |
NB: Dependency has been updated if both old and new version information is present.
NB: Dependency has been added if just the new version information is present.
NB: Dependency has been removed if new version information isn’t present.
[1.9] - 2022-12-21
Enhancements & fixes
- Bumped minimum Nextflow version from
21.10.3
->22.10.1
- Updated pipeline template to nf-core/tools 2.7.2
- Added support for generating nf-core/atacseq compatible samplesheets
- Added
--nf_core_rnaseq_strandedness
parameter to specify value forstrandedness
entry added to samplesheet created when using--nf_core_pipeline rnaseq
. The default isauto
which can be used with nf-core/rnaseq v3.10 onwards to auto-detect strandedness during the pipeline execution.
[1.8] - 2022-11-08
Enhancements & fixes
- #111 - Change input mimetype to csv
- #114 - Final samplesheet is not created when
--skip_fastq_download
is provided - #118 - Allow input pattern validation for csv/tsv/txt
- #119 -
--force_sratools_download
results in different fastq names compared to FTP download - #121 - Add
tower.yml
to render samplesheet as Report in Tower - Fetch
SRR
andDRR
metadata from ENA API instead of NCBI API to bypass frequent breaking changes - Updated pipeline template to nf-core/tools 2.6
[1.7] - 2022-07-01
⚠️ Major enhancements
Support for GEO ids has been dropped in this release due to breaking changes introduced in the NCBI API. For more detailed information please see this PR.
As a workaround, if you have a GEO accession you can directly download a text file containing the appropriate SRA ids to pass to the pipeline:
- Search for your GEO accession on GEO
- Click
SRA Run Selector
at the bottom of the GEO accession page - Select the desired samples in the
SRA Run Selector
and then download theAccession List
This downloads a text file called SRR_Acc_List.txt
that can be directly provided to the pipeline e.g. --input SRR_Acc_List.txt
.
Enhancements & fixes
[1.6] - 2022-05-17
- #57 - fetchngs fails if FTP is blocked
- #89 - Improve detection and usage of the NCBI user settings by using the standardized sra-tools modules from nf-core.
- #93 - Adjust modules configuration to respect the
publish_dir_mode
parameter. - [nf-core/rnaseq#764] - Test fails when using GCP due to missing tools in the basic biocontainer
- Updated pipeline template to nf-core/tools 2.4.1
Software dependencies
Dependency | Old version | New version |
---|---|---|
synapseclient | 2.4.0 | 2.6.0 |
[1.5] - 2021-12-01
- Finish porting the pipeline to the updated Nextflow DSL2 syntax adopted on nf-core/modules
- Bump minimum Nextflow version from
21.04.0
->21.10.3
- Removed
--publish_dir_mode
as it is no longer required for the new syntax
- Bump minimum Nextflow version from
[1.4] - 2021-11-09
Enhancements & fixes
- Convert pipeline to updated Nextflow DSL2 syntax for future adoption across nf-core
- Added a workflow to download FastQ files and to create samplesheets for ids from the Synapse platform hosted by Sage Bionetworks.
- SRA identifiers not available for direct download via the ENA FTP will now be downloaded via
sra-tools
. - Added
--force_sratools_download
parameter to preferentially download all FastQ files viasra-tools
instead of ENA FTP. - Correctly handle errors from SRA identifiers that do not return metadata, for example, due to being private.
- Retry an error in prefetch via bash script in order to allow it to resume interrupted downloads.
- Name output FastQ files by
{EXP_ACC}_{RUN_ACC}*fastq.gz
instead of{EXP_ACC}_{T*}*fastq.gz
for run id provenance - [#46] - Bug in sra_ids_to_runinfo.py
- Added support for DDBJ ids. See examples below:
DDBJ |
---|
PRJDB4176 |
SAMD00114846 |
DRA008156 |
DRP004793 |
DRR171822 |
DRS090921 |
DRX162434 |
[1.3] - 2021-09-15
Enhancements & fixes
- Replaced Python
requests
withurllib
to fetch ENA metadata
Software dependencies
Dependency | Old version | New version |
---|---|---|
python | 3.8.3 | 3.9.5 |
[1.2] - 2021-07-28
Enhancements & fixes
- Updated pipeline template to nf-core/tools 2.1
- [#26] - Update broken EBI API URL
[1.1] - 2021-06-22
Enhancements & fixes
[1.0] - 2021-06-08
Initial release of nf-core/fetchngs, created with the nf-core template.
Pipeline summary
Via a single file of ids, provided one-per-line the pipeline performs the following steps:
- Resolve database ids back to appropriate experiment-level ids and to be compatible with the ENA API
- Fetch extensive id metadata including direct download links to FastQ files via ENA API
- Download FastQ files in parallel via
curl
and performmd5sum
check - Collate id metadata and paths to FastQ files in a single samplesheet
Supported database ids
Currently, the following types of example identifiers are supported:
SRA | ENA | GEO |
---|---|---|
SRR11605097 | ERR4007730 | GSM4432381 |
SRX8171613 | ERX4009132 | GSE147507 |
SRS6531847 | ERS4399630 | |
SAMN14689442 | SAMEA6638373 | |
SRP256957 | ERP120836 | |
SRA1068758 | ERA2420837 | |
PRJNA625551 | PRJEB37513 |