1

I am trying to have "vector" column in Dask DataFrame, from a large np.array of vectors (at this point it is 500k * 1536 vector).

With Pandas DataFrame code would look something like this:

import pandas as pd
import numpy as np

vectors = np.array([
    np.array([1, 2, 3]), 
    np.array([4, 5, 6]), 
    np.array([7, 8, 9])
])

df = pd.DataFrame({
    "vector": vectors.tolist()
})

df

Result df structure looks good. However, it takes 34GB of memory just to load.

vector.
0 [1, 2, 3]
1 [4, 5, 6]
2 [7, 8, 9]

I tried a few options:

Option #1

import dask.dataframe as dd
import dask.array as da
import numpy as np

vectors = np.array([
    np.array([1, 2, 3]), 
    np.array([4, 5, 6]), 
    np.array([7, 8, 9])
])


vectors = da.from_array(vectors)
df = dd.from_dask_array(vectors)

df

This one results in df where each value of vector have its own column

Option #2

import dask.dataframe as dd
import dask.array as da
import numpy as np

# vectors = np.load(dataset_path / "vectors.npy")

vectors = np.array([
    np.array([1, 2, 3]), 
    np.array([4, 5, 6]), 
    np.array([7, 8, 9])
])

df = dd.from_dask_array(da.from_array(vectors))
columns_to_drop = df.columns.tolist()
df["vector"] = df.apply(lambda row: tuple(row), axis=1, meta=(None, 'object'))
df = df.drop(columns=columns_to_drop)

df

This one produces correct results but looks cumbersome and probably not efficient

Mike Chaliy
  • 25,801
  • 18
  • 67
  • 105

3 Answers3

1

One of possible workarounds is using dd.from_dict:

vectors = np.array([
    np.array([1, 2, 3]),
    np.array([4, 5, 6]),
    np.array([7, 8, 9])
])

df = dd.from_dict({'vector': vectors.tolist()}, vectors.shape[0]).compute()
print(df.head())

      vector
0  [1, 2, 3]
1  [4, 5, 6]
2  [7, 8, 9]
RomanPerekhrest
  • 88,541
  • 4
  • 65
  • 105
  • Under the hood, it uses pandas, while this works, it loads everything in memory, and in my case, this is what I am trying to address. – Mike Chaliy Aug 31 '23 at 18:41
0

This is only a pseudo-code solution, but judging from the code snippets in the original question, it should be feasible for you to transform into working code (assuming the solution looks promising). The main steps are:

  1. Identify the npy dimensions (by inspecting the file or from elsewhere).

  2. Create lazy pandas dataframes by using delayed on chunks of data (where chunking and number of chunks are determined from step 1).

  3. Create a dask dataframe via dd.from_delayed.

The function in step 2 could look like this:

from dask import delayed

@delayed
def lazy_pandas_df(index_start, index_end):
    # https://numpy.org/doc/stable/reference/generated/numpy.load.html
    array_chunk = np.load(my_npy, mmap_mode="r")
    vectors = array_chunk[index_start: index_end, :]
    df = pd.DataFrame({
        "vector": vectors.tolist()
    })
    return df
SultanOrazbayev
  • 14,900
  • 3
  • 16
  • 46
0

Before posting a real solution, here is a little advice. A numpy array like

x = np.arange(1536)

takes up 12400 bytes (12288 of data 112 for the object) but

y = x.tolist()

takes up 49208 bytes (12344 for the list and index, 24 bytes per integer).

If you also consider that operations on lists/dicts will not vectorize: do not use python objects of mathematical data!

mdurant
  • 27,272
  • 5
  • 45
  • 74
  • This is a really good point. However, I was not able to find a way to get data from np.array of np.arrays directly into DataFrame. – Mike Chaliy Sep 01 '23 at 14:56