I am trying to have "vector" column in Dask DataFrame, from a large np.array of vectors (at this point it is 500k * 1536 vector).
With Pandas DataFrame code would look something like this:
import pandas as pd
import numpy as np
vectors = np.array([
np.array([1, 2, 3]),
np.array([4, 5, 6]),
np.array([7, 8, 9])
])
df = pd.DataFrame({
"vector": vectors.tolist()
})
df
Result df structure looks good. However, it takes 34GB of memory just to load.
vector. | |
---|---|
0 | [1, 2, 3] |
1 | [4, 5, 6] |
2 | [7, 8, 9] |
I tried a few options:
Option #1
import dask.dataframe as dd
import dask.array as da
import numpy as np
vectors = np.array([
np.array([1, 2, 3]),
np.array([4, 5, 6]),
np.array([7, 8, 9])
])
vectors = da.from_array(vectors)
df = dd.from_dask_array(vectors)
df
This one results in df where each value of vector have its own column
Option #2
import dask.dataframe as dd
import dask.array as da
import numpy as np
# vectors = np.load(dataset_path / "vectors.npy")
vectors = np.array([
np.array([1, 2, 3]),
np.array([4, 5, 6]),
np.array([7, 8, 9])
])
df = dd.from_dask_array(da.from_array(vectors))
columns_to_drop = df.columns.tolist()
df["vector"] = df.apply(lambda row: tuple(row), axis=1, meta=(None, 'object'))
df = df.drop(columns=columns_to_drop)
df
This one produces correct results but looks cumbersome and probably not efficient