Pandas Vectorize Instead of Loop

Question

I have a dataframe of paths. The task is to get the last modification time for the folder using something like datetime.fromtimestamp(os.path.getmtime('PATH_HERE')) into a separate column

import pandas as pd
import numpy as np
import os


df1 = pd.DataFrame({'Path' : ['C:\\Path1' ,'C:\\Path2', 'C:\\Path3']})

#for a MVCE use the below commented out code. WARNING!!! This WILL Create directories on your machine.
#for path in df1['Path']:
#    os.mkdir(r'PUT_YOUR_PATH_HERE\\' + os.path.basename(path))

I can do the task with the below, but it is a slow loop if I have many folders:

for each_path in df1['Path']:
    df1.loc[df1['Path'] == each_path, 'Last Modification Time'] = datetime.fromtimestamp(os.path.getmtime(each_path))

How would I go about vectoring this process to improve speed? os.path.getmtime cannot accept the series. I'm looking for something like:

df1['Last Modification Time'] = datetime.fromtimestamp(os.path.getmtime(df1['Path']))

`df1['Path'].apply(lambda x : datetime.fromtimestamp(os.path.getmtime(x)))` ?? — Bharath M Shetty, Oct 19 '17 at 15:39
If `os.path.getmtime` cannot accept the series then broadcasting cannot be done so I dont think you can get a vectorized solution. — Bharath M Shetty, Oct 19 '17 at 15:44
@Bharathshetty, the apply method *is* faster in my short testing. By about 300ms each loop. Unfortunately, I was afraid of a non-vectorized solution not being possible — MattR, Oct 19 '17 at 15:50
@Bharathshetty, Currently 100 rows, but in the future could be N rows as our processes grow — MattR, Oct 19 '17 at 16:12
300ms for just 100 rows is bit a costly one. But currently I couldn't think of better ones than `apply` :(. — Bharath M Shetty, Oct 19 '17 at 16:16

score 1 · Accepted Answer · answered Oct 19 '17 at 18:29

I'm going to present 3 approaches assuming to work with 100 paths. Approach 3 is strictly preferable I think.

# Data initialisation:
paths100 = ['a_whatever_path_here'] * 100
df = pd.DataFrame(columns=['paths', 'time'])
df['paths'] = paths100


def fun1():
    # Naive for loop. High readability, slow.
    for path in df['paths']:
        mask = df['paths'] == path
        df.loc[mask, 'time'] = datetime.fromtimestamp(os.path.getmtime(path))


def fun2():
    # Naive for loop optimised. Medium readability, medium speed.
    for i, path in enumerate(df['paths']):
        df.loc[i, 'time'] = datetime.fromtimestamp(os.path.getmtime(path))


def fun3():
    # List comprehension. High readability, high speed.
    df['time'] = [datetime.fromtimestamp(os.path.getmtime(path)) for path in df['paths']]


% timeit fun1()
>>> 164 ms ± 2.03 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

% timeit fun2()
>>> 11.6 ms ± 67.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

% timeit fun3()
>>> 13.1 ns ± 0.0327 ns per loop (mean ± std. dev. of 7 runs, 100000000 loops each)

What's interesting is I used this same type of logic to test other functions that I have that are similar to this question. #3 is only faster for *this specific* scenario. The apply method that @Bharath shetty mentioned in the comments is the fastest in other scenarios — MattR, Oct 19 '17 at 20:24

score 0 · Answer 2 · answered Oct 19 '17 at 15:51

0

You can use a groupby transform (so that you are doing the expensive call only once per group):

g = df1.groupby("Path")["Path"]
s = pd.to_datetime(g.transform(lambda x: os.path.getmtime(x.name)))
df1["Last Modification Time"] = s  # putting this on two lines so it looks nicer...

answered Oct 19 '17 at 15:51

Andy Hayden

359,921
101
625
535

I wont have duplicate paths, but this is certainly handy for other code issues. This method is also slower than the normal lambda expression. as a side note: add `datetime.fromtimestamp()` around `os.path.getmtime`. Or else the values will be incorrect – MattR Oct 19 '17 at 15:57
@AndyHayden since OP has a unique path for every folder. I dont think groupby is necessary. – Bharath M Shetty Oct 19 '17 at 16:16

Pandas Vectorize Instead of Loop

2 Answers2