0

I have been experimenting with multithreading using the threading library and creating a different thread for several different functions. The functions take in a pandas dataframe as the argument and run an SQL query to AWS Redshift and add the retrieved data as a column to the dataframe. However, I have an issue where sometimes one of the columns will be empty when printing the dataframe after the threads have finished. This is seemingly random and sometimes all of the columns are added without any issues. I thought the purpose of .join() was to prevent this by waiting until each thread had been finished before continuing, but this does not seem to be the case.

import pandas as pd
import threading

df = pd.DataFrame()

def redshift_query1(df):
    run query
    df[column_name1] = query_results

def redshift_query2(df):
    run query
    df[column_name2] = query_results

def redshift_query3(df):
    run query
    df[column_name3] = query_results

t1 = threading.Thread(target=redshift_query1, args = [df])
t2 = threading.Thread(target=redshift_query2, args = [df])
t3 = threading.Thread(target=redshift_query3, args = [df])

t1.start()
t2.start()
t3.start()

t1.join()
t2.join()
t3.join()

print(df)

rdol500
  • 11
  • 2

1 Answers1

0

pandas is not thread safe. For more information, see. However, builtin types are thread safe in Python. So you can hold the result in a dict then create a DataFrame.

import pandas as pd
import threading

result = {}

def redshift_query1(df):
    result["column_name1"] = [3]

def redshift_query2(df):
     result["column_name2"] = [2]

def redshift_query3(df):
    result["column_name3"] = [1]

t1 = threading.Thread(target=redshift_query1, args = [df])
t2 = threading.Thread(target=redshift_query2, args = [df])
t3 = threading.Thread(target=redshift_query3, args = [df])

t1.start()
t2.start()
t3.start()

t1.join()
t2.join()
t3.join()

df = pd.DataFrame(result)
hasanyaman
  • 328
  • 2
  • 9