0

I have a modin dataframe having ~120k rows. I want coalesce some columns of it. Modin df iterrows is taking lot of time, so I tried with numpy.where. Numpy.where is on the equivalent pandas df does it in 5-10 minutes but same thing on modin df takes ~30 minutes. Any alternative to speed this task for modin dataframe?

[cols_to_be_coalesced] --> this list contains list of columns to be coalesced. It contains 10-15 columns.

Code:

for COL in [cols_to_be_coalesced]:
    df['COL'] = np.where(df['COL']!='', df['COL'], df['COL_X'])   

If df is pandas dataframe, it executes in ~10 minutes, but if its a modin dataframe, it takes ~30 minutes. So is there a any equivalent code for numpy.where for modin dataframes to speed up this operation?

James Z
  • 12,209
  • 10
  • 24
  • 44
Vikas Garud
  • 143
  • 2
  • 10
  • 1
    Try with - np.where(df['COL'] .values!='', df['COL'] .values, df['COL_X'] .values) – Nk03 Apr 23 '21 at 06:10
  • @Nk03 I tried with your suggestion, but it didn't make any difference, it took 1 hour to do it. numpy-pandas does it in 4 minutes. – Vikas Garud Apr 23 '21 at 08:40
  • what's the length of [cols_to_be_coalesced]. If it's large, then you should consider vectorizing it. – Nk03 Apr 23 '21 at 10:34
  • @Nk03 - The list contains 15-20 columns. Let me explain - I'm merging 5 datasets one by one. After EACH merge this above operation happens. Total records after 5 merges is ~120k. So after each merge, around 15 to 20 columns needs to coalesced using above code. Numpy-pandas takes just 5 minutes. But with modin, its taking 50 minutes. Can you please show me how to vectorize it for modin ? – Vikas Garud Apr 23 '21 at 10:50
  • So, that means for loop is running for 15-20 iterations?. If you use multithreading/ multiprocessing then you can speed up this for loop by 15 times approx. – Nk03 Apr 23 '21 at 10:58
  • @Nk03 - For loop really is not the problem because with this for loop numpy-pandas with np.where completes it in 5 minutes. np.where doesn't work for modin df as expected. I need np.where eqvivalent function for modin which will do this in similar time. – Vikas Garud Apr 23 '21 at 11:37

1 Answers1

1

I think that your np.where is slow because np.where converts the Modin dataframe to a numpy array, and converting Modin dataframes to numpy is slow. Is this version using pandas.Series.where (not the Modin where implementation, because that hasn't been added yet) faster for you?

for COL in [cols_to_be_coalesced]:
    df['COL'] = df['COL'].where(df['COL'] != '', df['COL_X']) 

I find that that approach takes 1.58 seconds versus 70 seconds for the original approach in this example:

import modin.pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0, 100, size=(2**20, 2**8))).add_prefix("col")
# setting column with np.where takes 70 seconds
df['col1'] = np.where(df['col1'] % 2 == 0, df['col1'], df['col2'])
# setting column with pandas.Series.where takes 1.58 seconds
df['col1'] = df['col1'].where(df['col1'] % 2 == 0, df['col2'])