6

I would like to give read-only access to shared DataFrame to multiple worker processes created by multiprocessing.Pool.map().

I would like to avoid copying and pickling.

I understood that pyarrow can be used for that. However, I find their documentation quite cumbersome. Can anybody provide an example on how it can be done?

Konstantin
  • 2,451
  • 1
  • 24
  • 26
  • pyarrow does not look like it lets you share dataframes across processes. It provides an interface *similar* to dataframes that can be shared across processes. If you want to avoid copying / pickling, you'll need to use `multiprocessing.sharedctypes` there's even an example of making a shared struct array for sharing structured data. – Aaron Feb 07 '19 at 22:07
  • 1
    From what I've understood from this talk: https://youtu.be/Hqi_Bw_0y8Q, pyarrow is supposed to provide a pandas integration that would allow to share DataFrame without copying or serialization not only across processes but even across languages, frameworks, and operating systems. – Konstantin Feb 07 '19 at 22:37
  • It would seem we are both at least partially correct. at 21.20 of the video you shared, Wes talks about the actual process of sharing data between multiple python processes. It would seem that you need to setup an apache spark instance to actually hold the data and the pyarrow streams in the data (read: serializes and copies) as needed. This is very similar to python multiprocessing shared objects, but the serialization format is much lower overhead and much more performant. – Aaron Feb 08 '19 at 16:47

1 Answers1

8

The example at https://github.com/apache/arrow/blob/master/python/examples/plasma/sorting/sort_df.py is a working example that shares a Pandas dataframe between multiple workers using Python multiprocessing (note that it requires you to build a small Cython library in order to run it).

The dataframe is shared via Arrow's Plasma object store.

If you are not tied to Python multiprocessing, you can use Ray to do what you want with simpler syntax.

To give multiple workers read-only access to a Pandas dataframe, you can do the following.

import numpy as np
import pandas
import ray

ray.init()

df = pandas.DataFrame(np.random.normal(size=(1000, 10)))

@ray.remote
def f(df):
    # This task will run on a worker and have read only access to the 
    # dataframe. For example, "df.iloc[0][0] = 1" will raise an exception.
    try:
        df.iloc[0][0] = 1
    except ValueError:
        pass
    return df.iloc[0][0]

# Serialize the dataframe with pyarrow and store it in shared memory.
df_id = ray.put(df)

# Run four tasks that have access to the dataframe.
result_ids = [f.remote(df_id) for _ in range(4)]

# Get the results.
results = ray.get(result_ids)

Note that the line df_id = ray.put(df) can be omitted (and you can directly call f.remote(df)). In that case, df will still be stored in shared memory and shared with the workers, but it will be stored 4 times (once for each call to f.remote(df)), which is less efficient.

Robert Nishihara
  • 3,276
  • 16
  • 17
  • Wow. That's actually quite a lot of boilerplate code in the Plasma example :( I wonder if `put_df()` is creating a copy or performing some transformations on the data, as it says "serializing" in comments. – Konstantin Feb 08 '19 at 08:31
  • You're right, `put_df` and `get_dfs` can actually be made much more concise now. E.g., now instead of `put_df(df)` you can use `client.put(df)` and instead of `get_df(object_ids)` you can do `client.get(object_ids)`. – Robert Nishihara Feb 09 '19 at 02:05
  • 1
    I have only now checked out Ray and it looks like an awesome alternative to multiprocessing and "raw" Arrow. In fact, it's using Arrow underneath, which I wasn't aware of. So, it looks like this answer provided exactly what I want. – Konstantin Feb 09 '19 at 07:31
  • 2
    I assume that `ray.put(df)` copies the data in the DataFrame to shared memory. Is it possible to create the DataFrame in such a way that this copying can be avoided completely? If I understood correctly, that's the promise of Arrow :) – Konstantin Feb 09 '19 at 07:36
  • You're right that `ray.put(df)` copies the data to shared memory. Assuming that the dataframe was initially created by pandas, we need at least one copy to convert it to the Arrow format (in Ray, this serialization step is combined with the copy into shared memory). On the other hand, if are able to create the dataframe in the Arrow format to begin with (without going through Pandas), then it should be possible to get rid of this copy. – Robert Nishihara Feb 11 '19 at 18:25