To understand how Modin speed up Pandas operation a few words about its archetecture. Modin Frame is 2D array of partitions, where each partition is a Pandas DataFrame (link to doc with explainfull images). Usually DataFrame splits in N_cores
partitions, so when we're doing some operation under our Modin Frame it's doing it in parallel on every partition, that's how Modin is speeding up Pandas computations.
Modin has a flexible mechanism of partitioning, it could repartition a frame on the fly depending on the operation. For example, when we're performing an operation that requires knowledges about the whole row (like in df.apply(fn)
, where fn
expects to get the row, so we need knowledge about whole of it) the Modin Frame will be repartitioned in only row partitions, so
modin_df.apply(fn)
will perform something like this (explainfull img).
As we see from the image, if we have a frame with shape (100000, 64) and apply a function, we'll get N parralel executions of .apply()
under (100000/N, 64) shape frames, which gives a decent speed up.