0

I use the modin library for multiprocessing. While the library is great for faster processing, it fails at merge and I would like to revert to default pandas in between the code.

I understand as per PEP 8: E402 conventions, import should be declared once and at the top of the code however my case would need otherwise.

import pandas as pd
import modin.pandas as mpd    
import os
import ray

ray.init()
os.environ["MODIN_ENGINE"] = "ray"

df = mpd.read_csv()
do stuff

Then I would like to revert to default pandas within the same code but how would i do the below in pandas as there does not seem to be a clear way to switch from pd and mpd in the below lines and unfortunately modin seems to take precedence over pandas.

df = df.loc[:, df.columns.intersection(['col1', 'col2'])]
df = df.drop_duplicates()
df = df.sort_values(['col1', 'col2'], ascending=[True, True])

Is it possible? if yes, how?

Rander
  • 94
  • 8

4 Answers4

2

You can simply do the following :

import modin.pandas as mpd

import pandas as pd

This way you have both modin as well as original pandas in memory and you can efficiently switch as per your need.

  • I have added code to explain. How do i call pandas specifically for some functions like `drop.duplicates` and `loc` and `sort` – Rander Jun 04 '22 at 04:00
  • Thanks for updating the question, I assume you cannot switch to OG pandas in this case or vice versa. The important factor here is the module from which you read csv, from the code sample above I see that read it from mpd (modin), so all the methods and functions that are associated with that dataframe will be based on modin. – Syntax Error Jun 04 '22 at 11:12
  • 1
    As @Nin17 pointed, you can call pandas with `df = df._to_pandas()` which changes the distribued dataframe to pandas dataframe that runs on a single core. – Rander Jun 05 '22 at 12:04
  • Oh alright. I wasn't aware of it. Thanks for updating. – Syntax Error Jun 05 '22 at 12:31
1

Since many have posted answers however in this particular case, as applicable and pointed out by @Nin17 and this comment from Modin GitHub, to convert from Modin to Pandas for single core processing of some of the operations like df.merge you can use

import pandas as pd
import modin.pandas as mpd    
import os
import ray
ray.init()
os.environ["MODIN_ENGINE"] = "ray"
df_modin = mpd.read_csv() #reading dataframe into Modin for parallel processing
df_pandas = df_modin._to_pandas() #converting Modin Dataframe into pandas for single core processing

and if you would like to reconvert the dataframe to a modin dataframe for parallel processing

df_modin = mpd.DataFrame(df_pandas)
Rander
  • 94
  • 8
0

You can try pandarallel package instead of modin , It is based on similar concept : https://pypi.org/project/pandarallel/#description

Pandarallel Benchmarks : https://libraries.io/pypi/pandarallel

  • This is great however, I am am running windows environment and `pandarallel` works on either mac or WSL or linux. – Rander Jun 05 '22 at 12:06
  • It is python library so all it needs is python environment . Windows / Mac / Linux shouldn't be an issue – Syntax Error Jun 05 '22 at 12:29
0

As @Nin17 said in a comment on the question, this comment from the Modin GitHub describes how to convert a Modin dataframe to pandas. Once you have a pandas dataframe, you call any pandas method on it. This other comment from the same issue describes how to convert the pandas dataframe back to a Modin dataframe.