How to download multiple datasets in parallel?

Question

i was trying to download 1700 (company) datasets of stock market to csv files using yahoo finance, and it is storing them successfully, i did it using while loop, i ran, while loop till 1700 times and it almost took more than 2 hrs, can i use parallel programming in python to save time do it in parallel way?

import pandas_datareader as web
import pandas as pd
import csv
import datetime

count=0;
while count<1700:
    df = web.DataReader("TCS.NS", 'yahoo', start,end)
    file = open('csv_file.csv')
    reader= csv.reader(file)
    df.to_csv('csv_file.csv')
    df = pd.read_csv('csv_file.csv')
    .
    .
    .
    count +=1

I also performing some various operations on data and storing it in MySQL database in while loop. Please help me to for this problem

score 2 · Answer 1 · answered Jun 13 '20 at 17:46

You can use threads:

from threading import Thread

def process_data(count):
    df = web.DataReader("TCS.NS", 'yahoo', start,end)
    file = open('csv_file.csv')
    reader= csv.reader(file)
    df.to_csv('csv_file.csv')
    df = pd.read_csv('csv_file.csv')
    ...

for count in range(1700):
    Thread(target=process_data, args=(count,)).start()

score -1 · Answer 2 · answered Jun 13 '20 at 17:47

You can do multithreaded programming to achieve this functionality. The basic idea is to create multiple threads, each downloading a subset of the whole dataset. For example, you can create 17 threads and each one of these threads will download 100 datasets.

Here's a good article on multithreading in python.

score -1 · Answer 3 · answered Jun 13 '20 at 17:53

-1

There are multiple ways to accomplish this, but to avoid the Global Interpreter Lock (GIL) you'll want to ensure you're using either multi-processing, or something like async I/O.

In Python, multithreading will still only allow one thread to execute at a time. With multi-processing, you can actually spawn multiple processes in parallel. I would recommend something like multiprocessing pool, it's fairly straightforward to get started in.

Also to be safe, I would move the write to database OUTSIDE of the loop where you're spawning the file retrieval. You'll likely want to avoid multiple concurrent writes to the database as you're pulling that data, unless you know how to do so safely.

answered Jun 13 '20 at 17:53

Eddie

106
7

1

``asyncio`` is also restricted by the GIL. Regardless, the GIL is not an issue for I/O bound tasks; network requests usually are I/O bound due to network latency. – MisterMiyagi Jun 13 '20 at 17:55
1

@AminGuermazi Due to the [GIL](https://docs.python.org/3/glossary.html#term-global-interpreter-lock), common Python implementations execute threads *concurrently*, not *in parallel*. Only select few, inbuilt operations, such as I/O, can be performed in parallel by threads. – MisterMiyagi Jun 13 '20 at 18:04
Thank you for sharing your knowledge @MisterMiyagi. I follow your responses often and continue to absorb what I can. I appreciate the added insight and constructive feedback. – Eddie Jun 14 '20 at 03:15

How to download multiple datasets in parallel?

3 Answers3