Pandas string subscripting does not work in modin (and related questions about converting pandas code to modin)

Question

I recently learned about modin, and am trying to convert some of my code from pandas to modin. My understanding is that modin has some operations that run faster and others that it has not optimized, so it defaults to pandas for those. Thus anything that runs in pandas should run in modin, but this does not seem to be the case.

The following code is WAI in pandas, but I get an error in modin:

#import modin.pandas as pd
import pandas as pd

dates = pd.date_range('20180101',periods=6)
pid=pd.Series(list(range(6)))
strings=pd.Series(['asdfjkl;','qwerty','zxcvbnm']*2)
frame={'id':pid,'date':dates,'strings':strings}

df=pd.DataFrame(frame)

x=2
df['first_x_string']=df['strings'].str[0:x]

print(df)

which returns:

   id       date   strings first_x_string
0   0 2018-01-01  asdfjkl;             as
1   1 2018-01-02    qwerty             qw
2   2 2018-01-03   zxcvbnm             zx
3   3 2018-01-04  asdfjkl;             as
4   4 2018-01-05    qwerty             qw
5   5 2018-01-06   zxcvbnm             zx

but when I use modin.pandas (swapping which line is commented at the start), I get the error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-16-e08362b2a4c0> in <module>
      1 x=2
----> 2 df['first_x_string']=df['strings'].str[0:x]
      3 
      4 print(df)

TypeError: 'StringMethods' object is not subscriptable

I also get additional user warnings that I did not get for pandas:

UserWarning: Distributing <class 'list'> object. This may take some time.
UserWarning: Distributing <class 'dict'> object. This may take some time.

My questions are:

~~How do I fix this?~~
As I look to convert code to modin, are there specific types of commands that will work in pandas but not in modin?
Do the user warnings indicate that some operations are slower in modin than pandas, so that I should be selective about what I choose to use it for?
Additionally, is it feasible (or desireable) to use modin to do certain operations like read_csv() to create a dataframe, then use pandas to run operations on that dataframe, and possibly use modin again to save the dataframe? For my current processes, loading (and to a lesser degree saving) are the most intensive tasks.

#========================================

Update:

#========================================

I have figured out fixes for the specific question I asked, but would like the other (more general) questions answered. Code for alternative methods of capturing the first x characters in a string, with timing functions:

import time
x=2

tic = time.perf_counter()
#df['first_x_string']=df['strings'].str[0:x]
toc = time.perf_counter()
print(f'original completed in {toc-tic:0.4f} seconds')

tic = time.perf_counter()
df['first_x_string']=df['strings'].str.get(0)+df['strings'].str.get(1)
toc = time.perf_counter()
print(f'2x get() completed in {toc-tic:0.4f} seconds')

tic = time.perf_counter()
df['first_x_string']=[y[0:x] for y in df['strings']]
toc = time.perf_counter()
print(f'list comprehension completed in {toc-tic:0.4f} seconds')

print(df)

Running this on a dataframe that is 100X the example one returns:

Pandas:

original completed in 0.0016 seconds
2x get() completed in 0.0020 seconds
list comprehension completed in 0.0009 seconds
      id       date   strings first_x_string
0      0 2018-01-01  asdfjkl;             as
1      1 2018-01-02    qwerty             qw
2      2 2018-01-03   zxcvbnm             zx
3      3 2018-01-04  asdfjkl;             as
4      4 2018-01-05    qwerty             qw
..   ...        ...       ...            ...
595  595 2019-08-19    qwerty             qw
596  596 2019-08-20   zxcvbnm             zx
597  597 2019-08-21  asdfjkl;             as
598  598 2019-08-22    qwerty             qw
599  599 2019-08-23   zxcvbnm             zx

[600 rows x 4 columns]

modin:

original completed in 0.0000 seconds
2x get() completed in 0.2152 seconds
list comprehension completed in 0.1667 seconds
      id       date   strings first_x_string
0      0 2018-01-01  asdfjkl;             as
1      1 2018-01-02    qwerty             qw
2      2 2018-01-03   zxcvbnm             zx
3      3 2018-01-04  asdfjkl;             as
4      4 2018-01-05    qwerty             qw
..   ...        ...       ...            ...
595  595 2019-08-19    qwerty             qw
596  596 2019-08-20   zxcvbnm             zx
597  597 2019-08-21  asdfjkl;             as
598  598 2019-08-22    qwerty             qw
599  599 2019-08-23   zxcvbnm             zx

[600 rows x 4 columns]

These comparisons seem to illustrate that modin is not always faster, and reiterates my questions about when to use modin, and whether we can mix/match pandas and modin (or if that's not best practice and why).

score 1 · Answer 1 · answered Dec 07 '21 at 00:54

How do I fix this?

This pull request added that feature to Modin. Your code does not raise a warning on Modin version 0.12.0.

As I look to convert code to modin, are there specific types of commands that will work in pandas but not in modin?

There are several such commands. Right now Modin covers over 88% of the API for both Dataframe and Series on both Ray and Dask, as stated here. You can view a detailed summary of API coverage for Dataframes here and for Series here.

Normally when Modin doesn't cover a Pandas method, it should default to Pandas, so that it can be used as a drop-in replacement for all Pandas code.

Do the user warnings indicate that some operations are slower in modin than pandas, so that I should be selective about what I choose to use it for?

Modin issues these warnings when you initialize a Modin Dataframe or Series out of a Python object that is not already distributed. They indicate that Modin must pay an up-front cost to distribute your data over multiple cores or nodes. A Pandas dataframe or series would not pay that cost, so it may initialize faster. However, ideally, future operations should be faster on a distributed Modin dataframe or series than on an equivalent Pandas object.

Additionally, is it feasible (or desireable) to use modin to do certain operations like read_csv() to create a dataframe, then use pandas to run operations on that dataframe, and possibly use modin again to save the dataframe?

This is possible. You can convert the Modin dataframe to Pandas at any point with _to_pandas(), and you can create a Modin dataframe out of a Pandas one by passing the Pandas dataframe to the constructor of modin.pandas.DataFrame. There's nothing wrong with doing this, but Modin aims to give better performance on its own.

Regarding the performance of this example, I find that, as the warnings suggest, the lines creating the series and the dataframe are much slower on Modin. The last line, df['first_x_string']=df['strings'].str[0:x], is slower for 600 dates, but once I go to 60000 dates, Modin takes 11.1 ms, while Pandas takes 15.4 ms. Generally, Modin is likely to perform better on larger datasets, while the overhead of splitting the data and coordinating parallel computation might make slower on smaller ones.

If you find that a particular operation is too slow on Modin, I suggest that you file a reproducible example on the Modin GitHub here.

Pandas string subscripting does not work in modin (and related questions about converting pandas code to modin)

1 Answers1