0

I have following query that returns data between a date ranges.

dates = ['20100101','20100201',20100301','20100401']

query = 'select date, company_name, total_amount from info_stats where date between 'start_date' and 'end_date'

I obtain date ranges from another process and construct a list so that I can iterate as follows:

pds = []
for idx in range(0, len(dates) - 1):
   formated_query = self.get_formated_query(start_date=dates[idx].strftime('%Y%m%d'),
                                                      end_date=dates[idx + 1].strftime('%Y%m%d')
   results_df = pds.append(pd.read_sql(sql=formated_query,con=db_connect))

to query I am passing date at index and index + 1 (date greater than date at the index)

These queries take super long time and i want to execute in a parallelize manner so that wait time is shorter. I went over joblib but not sure if this is multi-threading or multi-processing. Looks like the former. Also new to joblib, how can I parallelize above code using joblib or other package?

add-semi-colons
  • 18,094
  • 55
  • 145
  • 232
  • 1
    Did you see https://stackoverflow.com/questions/32136276/parallelizing-pandas-pyodbc-sql-database-calls? – balderman Apr 10 '19 at 11:02

1 Answers1

0

The question is quite broad, but I can share my own experience with parallelising queries against databases.

What I found is that if I have many small jobs, I can use the python built-in multithreading modules such as concurrent.futures. I will get speed ups.

However, if I have big jobs that take a long time to run on the database, parallelising does not help. This is because the database engine itself (in my case SQL Server), already does a splendid job of parallelising the job. In which case, the single big job is already maximising the number of processes the server can handle - putting more jobs won’t help. Your situation seems to be the this one.

Matthias
  • 11
  • 2