-1

I wanted to download FASTQ files associated with a particular BioProject (PRJEB21446) from the European Nucleotide Archive. There is a button to generate and download a shell script containing wget commands for all FASTQ files associated with the BioProject. Great! That gives me a script with the following commands:

wget -nc [ftp-link-to-sample1.fastq.gz]
wget -nc [ftp-link-to-sample2.fastq.gz]
...
wget -nc [ftp-link-to-sample40.fastq.gz]

EDIT: Here are the first 5 lines of the script from ENA:

wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR201/004/ERR2014384/ERR2014384_1.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR201/006/ERR2014386/ERR2014386_1.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR201/001/ERR2014361/ERR2014361_1.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR201/009/ERR2014369/ERR2014369_1.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR201/007/ERR2014367/ERR2014367_1.fastq.gz

However, when I tried to run the script using sh script_from_ENA.sh, the first file downloads without any problems, but all files after that are stuck at 0% for about 20 seconds, then show the following:

2023-08-14 10:54:01 (0.00 B/s) - Data transfer aborted.
Retrying.

wget then attempts to download the same file over and over again with no success.

After spending all morning trying various workarounds, I eventually solved the problem by putting all the URLs into a single file and running wget in a for loop, like so:

sed 's/wget -nc //' script_from_ENA.sh > url-list
for i in `cat url-list` ; do wget -nc $i ; done

This worked like a charm and the files downloaded without any problem, but I'm still curious as to why the script generated by ENA didn't work. Was it an issue with wget or the ENA servers cutting me off?

If anyone can offer insight or an explanation, I'd be very grateful- thanks!

Sj1993
  • 34
  • 4
  • Add first 5-10 lines of `script_from_ENA.sh` to your question. – Cyrus Aug 14 '23 at 17:36
  • ... that is, *verbatim*. With complete URLs. – John Bollinger Aug 14 '23 at 17:47
  • 2
    There is no particular reason in the shell or in `wget` why running multiple `wget` commands via a loop should have different behavior than running the same commands individually. However, depending on the form of the URLs, it might be that the shell interprets the lines of the file differently one way than it does the other. – John Bollinger Aug 14 '23 at 17:52
  • There's probably one or more URLs with a metacharacter like `$'"` that gets interpreted when the script is run, causing wget to see an invalid URL, and keeps retrying as the server is failing. Reading from a file will cause the metacharacters not to be interpreted, circumventing the problem. – that other guy Aug 14 '23 at 18:17
  • Appreciate all the comments: added the first five wget commands to the question! Copied and pasted straight from the file from ENA – Sj1993 Aug 14 '23 at 18:25
  • 1
    I had no problems downloading all 5 files with 108 MBytes/s. – Cyrus Aug 14 '23 at 18:42
  • I wonder if it's an issue on my end... I tried it on a server and local machine, but same result. In any case, I've managed to get the data one way or another! Thanks for the input, all. – Sj1993 Aug 14 '23 at 18:43
  • 2
    You are welcome. I suggest you delete this question, because it has nothing to do with software development. – Cyrus Aug 14 '23 at 18:50
  • It occurs to me that it might be a problem with how your router handles FTP connections. FTP's an ancient protocol, from before they really knew how to make a solid protocol. It uses separate TCP connections for control vs data, in ways that tend to cause trouble with firewalls and/or NAT routing. It may be your router will put up with a single FTP connection, but with multiple simultaneous connections it drops some of them. – Gordon Davisson Aug 15 '23 at 03:25

1 Answers1

1

Note that if you have list of URLs you do not need to do

sed 's/wget -nc //' script_from_ENA.sh > url-list
for i in `cat url-list` ; do wget -nc $i ; done

as wget has option for that case, namely -i file or --input-file=file which as wget man page says does

Read URLs from a local or external file.

in your case, if you have urls.txt like so

ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR201/004/ERR2014384/ERR2014384_1.fastq.gz
ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR201/006/ERR2014386/ERR2014386_1.fastq.gz
ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR201/001/ERR2014361/ERR2014361_1.fastq.gz
ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR201/009/ERR2014369/ERR2014369_1.fastq.gz
ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR201/007/ERR2014367/ERR2014367_1.fastq.gz

you could just do

wget -i urls.txt
Daweo
  • 31,313
  • 3
  • 12
  • 25