How to read and store vector (List[float]) in Dask DataFrame?

Question

I am trying to have "vector" column in Dask DataFrame, from a large np.array of vectors (at this point it is 500k * 1536 vector).

With Pandas DataFrame code would look something like this:

import pandas as pd
import numpy as np

vectors = np.array([
    np.array([1, 2, 3]), 
    np.array([4, 5, 6]), 
    np.array([7, 8, 9])
])

df = pd.DataFrame({
    "vector": vectors.tolist()
})

df

Result df structure looks good. However, it takes 34GB of memory just to load.

	vector.
0	[1, 2, 3]
1	[4, 5, 6]
2	[7, 8, 9]

I tried a few options:

Option #1

import dask.dataframe as dd
import dask.array as da
import numpy as np

vectors = np.array([
    np.array([1, 2, 3]), 
    np.array([4, 5, 6]), 
    np.array([7, 8, 9])
])


vectors = da.from_array(vectors)
df = dd.from_dask_array(vectors)

df

This one results in df where each value of vector have its own column

Option #2

import dask.dataframe as dd
import dask.array as da
import numpy as np

# vectors = np.load(dataset_path / "vectors.npy")

vectors = np.array([
    np.array([1, 2, 3]), 
    np.array([4, 5, 6]), 
    np.array([7, 8, 9])
])

df = dd.from_dask_array(da.from_array(vectors))
columns_to_drop = df.columns.tolist()
df["vector"] = df.apply(lambda row: tuple(row), axis=1, meta=(None, 'object'))
df = df.drop(columns=columns_to_drop)

df

This one produces correct results but looks cumbersome and probably not efficient

Would `dd.from_pandas` work for your use case? (it creates a dask data frame with vector column) — SultanOrazbayev, Aug 31 '23 at 18:13
I omitted this from the question, but vectors is 500k * 1536, and it is just too big to have it a memory. — Mike Chaliy, Aug 31 '23 at 18:24
I see, what is the raw data? Is it stored in a single npz or multiple files? — SultanOrazbayev, Aug 31 '23 at 19:52
@mdurant this column represents the vector (embedding) attached to the data. So most of the time I just read/write this value of this vector, with no math against this column at all. — Mike Chaliy, Sep 01 '23 at 15:00
There is no point in loading data if you don't do anything with it. What do you do with these values? — mdurant, Sep 01 '23 at 16:52
And if you really only load and save, how do you save? Pandas will not necessarily do anything sensible with list elements in, say, to_csv. — mdurant, Sep 01 '23 at 16:58

score 1 · Answer 1 · answered Aug 31 '23 at 18:25

1

One of possible workarounds is using dd.from_dict:

vectors = np.array([
    np.array([1, 2, 3]),
    np.array([4, 5, 6]),
    np.array([7, 8, 9])
])

df = dd.from_dict({'vector': vectors.tolist()}, vectors.shape[0]).compute()
print(df.head())

      vector
0  [1, 2, 3]
1  [4, 5, 6]
2  [7, 8, 9]

answered Aug 31 '23 at 18:25

RomanPerekhrest

88,541
4
65
105

Under the hood, it uses pandas, while this works, it loads everything in memory, and in my case, this is what I am trying to address. – Mike Chaliy Aug 31 '23 at 18:41

score 0 · Answer 2 · answered Aug 31 '23 at 21:31

This is only a pseudo-code solution, but judging from the code snippets in the original question, it should be feasible for you to transform into working code (assuming the solution looks promising). The main steps are:

Identify the npy dimensions (by inspecting the file or from elsewhere).
Create lazy pandas dataframes by using delayed on chunks of data (where chunking and number of chunks are determined from step 1).
Create a dask dataframe via dd.from_delayed.

The function in step 2 could look like this:

from dask import delayed

@delayed
def lazy_pandas_df(index_start, index_end):
    # https://numpy.org/doc/stable/reference/generated/numpy.load.html
    array_chunk = np.load(my_npy, mmap_mode="r")
    vectors = array_chunk[index_start: index_end, :]
    df = pd.DataFrame({
        "vector": vectors.tolist()
    })
    return df

score 0 · Answer 3 · answered Sep 01 '23 at 14:01

0

Before posting a real solution, here is a little advice. A numpy array like

x = np.arange(1536)

takes up 12400 bytes (12288 of data 112 for the object) but

y = x.tolist()

takes up 49208 bytes (12344 for the list and index, 24 bytes per integer).

If you also consider that operations on lists/dicts will not vectorize: do not use python objects of mathematical data!

answered Sep 01 '23 at 14:01

mdurant

27,272
5
45
74

This is a really good point. However, I was not able to find a way to get data from np.array of np.arrays directly into DataFrame. – Mike Chaliy Sep 01 '23 at 14:56

How to read and store vector (List[float]) in Dask DataFrame?

3 Answers3