Vectorized method to format a column of integers into specified-length strings in both pandas dataframe and dask dataframe

Question

I have a pandas Dataframe:

   date    time               user_id
0  20160921    5947  13079492369730773513
1  20160921    5948  13079492369730773513
2  20160921  235949  13079492369730773513
3  20160921  235950  13079492369730773513
4  20160921  235951  13079492369730773513

I want to format the 'time' column into:

   date    time               user_id
0  20160921  005947  13079492369730773513
1  20160921  005948  13079492369730773513
2  20160921  235949  13079492369730773513
3  20160921  235950  13079492369730773513
4  20160921  235951  13079492369730773513

I know the list comprehension way:

df['time'] = ["%06d" % t for t in df['time'].tolist()]

Is there any vectorized method to do the same trick? And how to do this if it is a Dask Dataframe?

Graipher · Answer 1 · 2018-03-12T14:36:03.433

Yes, there is a vectorized method to do the same thing. You can cast the column to strings and then use string methods on it:

df.time.astype(str).str.zfill(6)
0    005947
1    005948
2    235949
3    235950
4    235951

Afterwards assign it back:

df.time = df.time.astype(str).str.zfill(6)

This assumes that the maximum length of the time string is 6 characters.

Unfortunately, this is a lot slower than the list comprehension way:

In [5]: %timeit df.time.astype(str).str.zfill(6)
228 µs ± 4.99 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [6]: %timeit ["%06d" % t for t in df['time'].tolist()]
17.5 µs ± 208 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Vectorized method to format a column of integers into specified-length strings in both pandas dataframe and dask dataframe

1 Answers1