Python Multithreading Implementation of to_datetime Function

Question

I need to implement multithreading to a Python job.

I have a dictionary and each key in that dictionary (out of about 40) is a timestamped pandas dataframe. Most of the dataframes have 100,000+ rows. Their timestamps are strings in "%Y-%m-%d %H:%M:%S" format.

To convert the timestamped strings I use the following function:

def to_dt(df):
    df['timestamp'] = df['timestamp'].map(lambda n: pd.to_datetime(n, format='%Y-%m-%d %H:%M:%S'))
    return df

So I would like to put each process to_dt(df) in a separate thread. How can I do that?

To simplify let's consider we have the following setup:

def to_dt(df):
    df['timestamp'] = df['timestamp'].map(lambda n: pd.to_datetime(n, format='%Y-%m-%d %H:%M:%S'))
    return df

# empty dictionary
d_test = {}

# dataframe with single string timestamp column
df = pd.DataFrame(columns=['st_dt'])

# populate dataframe with 1000 timestamp rows
for i in range(1000):
    df.loc[len(df)] = ['2018-10-02 10:00:00']

# add 20 instances of the dataframe to the dictionary with keys in format "a0" to 'a19'
for i in range(20):
    d_test['a'+str(i)] = df

Now how can we make each iteration of

for i in range(20): to_dt(d_test['a'+str(i)])

to run in a separate thread?

With the presence of [GIL](https://wiki.python.org/moin/GlobalInterpreterLock), multi-threading like this will make your process much slower. — Bubble Bubble Bubble Gut, Oct 02 '18 at 15:14
Basically you are saying I am better off without threading in that case: did I get that right? — Marin Draganov, Oct 02 '18 at 15:31

score 1 · Accepted Answer · answered Oct 02 '18 at 18:33

Due to the existence of GIL, one and only one thread is running at any time in Python, so multithreading in this case would only make the performance worse.

In order to use multiple cores, you need multiprocessing instead of multithreading, but the heavy overhead of spawning a new process will surely overtake the benefit, so it's better to use a single pd.to_datetime in your case.

Also this post explains GIL quite well.

score 0 · Answer 2 · answered Oct 02 '18 at 16:45

Just read that threading in Python should only be used when there is some kind of waiting during the execution of a process, for instance when connecting to a remote server or when port scanning, etc.

In the above case there isn't any waiting so no threadig is needed.

Python Multithreading Implementation of to_datetime Function

2 Answers2