0
import wget
with open('downloadhlt.txt') as file:
    urls = file.read()
    for line in urls.split('\n'):
        wget.download(line, 'localfolder')

for some reason the post wouldn't work so I put the code above What I'm trying to do is from a text file that has ~2 million of lines like these.

http://halitereplaybucket.s3.amazonaws.com/1475594084-2235734685.hlt
http://halitereplaybucket.s3.amazonaws.com/1475594100-2251426701.hlt
http://halitereplaybucket.s3.amazonaws.com/1475594119-2270812773.hlt

I want to grab each line and request it so it downloads as a group greater than 10. Currently, what I have and it downloads one item at a time, it is very time-consuming.

I tried looking at Ways to read/edit multiple lines in python but the iteration seems to be for editing while mine is for multiple executions of wget.

I have not tried other methods simply because this is the first time I have ever been in the need to make over 2 million download calls.

M R
  • 3
  • 2
  • 1
    Hi, welcome to Stack Overflow. Can you please explain what is the error or problem? Why does it not work? –  May 05 '20 at 01:40
  • @Kos it works in the notion that the code executes 1 line at a time. but there are 2 million lines and if they were to execute at 1 per second then it would take ~23 days to complete. I'm just wondering how to make it faster. what I'm asking is how can I speed this up? – M R May 05 '20 at 02:13

1 Answers1

1

This should work fine, I'm a total newbie so I can't really advice you on the number of thread to start lol. These are my 2 cents anyway, hope it somehow helps.

I tried timing yours and mine over 27 downloads:

(base) MBPdiFrancesco:stack francesco$ python3 old.py
Elapsed Time: 14.542160034179688
(base) MBPdiFrancesco:stack francesco$ python3 new.py
Elapsed Time: 1.9618661403656006

And here is the code, you have to create a "downloads" folder

import wget
from multiprocessing.pool import ThreadPool
from time import time as timer


s = timer()
thread_num = 8


def download(url):
    try:
        wget.download(url, 'downloads/')
    except Exception as e:
        print(e)


if __name__ == "__main__":
    with open('downloadhlt.txt') as file:
        urls = file.read().split("\n")
    results = ThreadPool(8).imap_unordered(download, urls)
    c = 0
    for i in results:
        c += 1
        print("Downloaded {} file{} so far".format(c, "" if c == 1 else "s"))


print("Elapsed Time: {} seconds\nDownloaded {} files".format(timer() - s, c))
  • 2
    Great answer, This is amazing! it works great. I pushed it to have 1000 threads because I have no idea if it will crash it or not but its really pushing the download speeds of my internet connection! – M R May 05 '20 at 06:39