What can be modin used for?

Question

I have been looking at parallelizing options and found ray and modin. After some tests I got slightly lost in what benefits from using modin. Two examples:

df = pd.read_csv() for 180 MB file pandas 5.2s vs. modin.pandas 2.7s

but df.groupby() pandas 0.59s vs. modin.pandas 5.46s

What kind of applications will benefit from using modin? It there a general rule here or everything has to be tested separately?

I found this benchmark https://www.kdnuggets.com/2019/11/speed-up-pandas-4x.html - very informative: speed up for read_csv() 2.6x; pd.concat() 86.83x; df.fillna() 8.57x ;df.count() 23.70x; df.isnull() 83.17x slow down for: df.groupby(),df.dropna(),df.drop_duplicates(),df.describe(),df.max() — Tomasz Turowski, Jan 27 '21 at 11:00

lytseeker · Answer 1 · 2021-01-08T14:21:05.817

From https://modin.readthedocs.io/en/latest/

Modin uses Ray or Dask to provide an effortless way to speed up your pandas notebooks, scripts, and libraries. Unlike other distributed DataFrame libraries, Modin provides seamless integration and compatibility with existing pandas code. Even using the DataFrame constructor is identical.

Two main features that stand out are:

Using multiple cores of CPU with same pandas API:

In pandas, you are only able to use one core at a time when you are doing computation of any kind. With Modin, you are able to use all of the CPU cores on your machine.

Support for very big datasets

With Modin, because of its light-weight, robust, and scalable nature, you get a fast DataFrame at 1MB and 1TB+

Specifically for the slow group_by part of the question, there is a github discussion that points out that regular old pandas works better than modin.pandas: https://github.com/modin-project/modin/issues/895

Modin is still under active development, the README.md from their github repo(https://github.com/modin-project/modin) tabulates panda API coverage mentioning these functions:

score 0 · Answer 2 · edited Jan 21 '21 at 07:33

As a rule of thumb any transformations that you would be doing on the columns like the aggregate functions(groupby(), sum(), count()) will always be faster in modin.

The very simple reason is that Modin will be using the multiple cores of your machine and hence all these will be faster than what pandas would do.

typically if you are using .transform() or .apply() on any of columns, Modin will be able to do it faster.

However, there are a few cases in which modin WILL BE SLOWER than PANDAS. example :

.append()

thanks!

What can be modin used for?

2 Answers2