I recently learned about modin, and am trying to convert some of my code from pandas to modin. My understanding is that modin has some operations that run faster and others that it has not optimized, so it defaults to pandas for those. Thus anything that runs in pandas should run in modin, but this does not seem to be the case.
The following code is WAI in pandas, but I get an error in modin:
#import modin.pandas as pd
import pandas as pd
dates = pd.date_range('20180101',periods=6)
pid=pd.Series(list(range(6)))
strings=pd.Series(['asdfjkl;','qwerty','zxcvbnm']*2)
frame={'id':pid,'date':dates,'strings':strings}
df=pd.DataFrame(frame)
x=2
df['first_x_string']=df['strings'].str[0:x]
print(df)
which returns:
id date strings first_x_string
0 0 2018-01-01 asdfjkl; as
1 1 2018-01-02 qwerty qw
2 2 2018-01-03 zxcvbnm zx
3 3 2018-01-04 asdfjkl; as
4 4 2018-01-05 qwerty qw
5 5 2018-01-06 zxcvbnm zx
but when I use modin.pandas (swapping which line is commented at the start), I get the error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-16-e08362b2a4c0> in <module>
1 x=2
----> 2 df['first_x_string']=df['strings'].str[0:x]
3
4 print(df)
TypeError: 'StringMethods' object is not subscriptable
I also get additional user warnings that I did not get for pandas:
UserWarning: Distributing <class 'list'> object. This may take some time.
UserWarning: Distributing <class 'dict'> object. This may take some time.
My questions are:
How do I fix this?- As I look to convert code to modin, are there specific types of commands that will work in pandas but not in modin?
- Do the user warnings indicate that some operations are slower in modin than pandas, so that I should be selective about what I choose to use it for?
- Additionally, is it feasible (or desireable) to use modin to do certain operations like read_csv() to create a dataframe, then use pandas to run operations on that dataframe, and possibly use modin again to save the dataframe? For my current processes, loading (and to a lesser degree saving) are the most intensive tasks.
#========================================
Update:
#========================================
I have figured out fixes for the specific question I asked, but would like the other (more general) questions answered. Code for alternative methods of capturing the first x
characters in a string, with timing functions:
import time
x=2
tic = time.perf_counter()
#df['first_x_string']=df['strings'].str[0:x]
toc = time.perf_counter()
print(f'original completed in {toc-tic:0.4f} seconds')
tic = time.perf_counter()
df['first_x_string']=df['strings'].str.get(0)+df['strings'].str.get(1)
toc = time.perf_counter()
print(f'2x get() completed in {toc-tic:0.4f} seconds')
tic = time.perf_counter()
df['first_x_string']=[y[0:x] for y in df['strings']]
toc = time.perf_counter()
print(f'list comprehension completed in {toc-tic:0.4f} seconds')
print(df)
Running this on a dataframe that is 100X the example one returns:
Pandas:
original completed in 0.0016 seconds
2x get() completed in 0.0020 seconds
list comprehension completed in 0.0009 seconds
id date strings first_x_string
0 0 2018-01-01 asdfjkl; as
1 1 2018-01-02 qwerty qw
2 2 2018-01-03 zxcvbnm zx
3 3 2018-01-04 asdfjkl; as
4 4 2018-01-05 qwerty qw
.. ... ... ... ...
595 595 2019-08-19 qwerty qw
596 596 2019-08-20 zxcvbnm zx
597 597 2019-08-21 asdfjkl; as
598 598 2019-08-22 qwerty qw
599 599 2019-08-23 zxcvbnm zx
[600 rows x 4 columns]
modin:
original completed in 0.0000 seconds
2x get() completed in 0.2152 seconds
list comprehension completed in 0.1667 seconds
id date strings first_x_string
0 0 2018-01-01 asdfjkl; as
1 1 2018-01-02 qwerty qw
2 2 2018-01-03 zxcvbnm zx
3 3 2018-01-04 asdfjkl; as
4 4 2018-01-05 qwerty qw
.. ... ... ... ...
595 595 2019-08-19 qwerty qw
596 596 2019-08-20 zxcvbnm zx
597 597 2019-08-21 asdfjkl; as
598 598 2019-08-22 qwerty qw
599 599 2019-08-23 zxcvbnm zx
[600 rows x 4 columns]
These comparisons seem to illustrate that modin is not always faster, and reiterates my questions about when to use modin, and whether we can mix/match pandas and modin (or if that's not best practice and why).