0

I have a data frame with a column as comma-separated-values encoded with quotes ie., string object. Ex:

  df['a']
'1,2,3,4,5'
'2,3,4,5,6'

I am able to convert the string formatted list of values to a NumPy array and able to do my operation successfully.

def func(x):
    return something

for t_df in pd.read_csv("testset.csv",chunksize=2000):
    t_df['predicted'] = t_df['prev'].parallel_apply(lambda x : arima(ast.literal_eval(x),1))

Until now I haven't any issue. But the func running forecasting models which is pretty timeconsuming and the data frame size is 2 million records.

So, I have tried cudf package in python for leveraging GPU functionality on Pandas like data frames. Here the problem arises

for t_df in pd.read_csv("testset.csv",chunksize=2):
    t_df['prev'] = t_df['prev'].apply(lambda x : np.array(ast.literal_eval(x)))
    t_df = cudf.DataFrame.from_pandas(t_df)

When I am applying the same operation, it is failing with the error which is basically unable to convert the string-like object to NumPy array. Error is as follows

> ---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-19-e7866d751352> in <module>
     12     t_df['prev'] = t_df['prev'].apply(lambda x : np.array(ast.literal_eval(x)))
     13     st = time.time()
---> 14     t_df = cudf.DataFrame.from_pandas(t_df)
     15     t_df['predicted'] = 10
     16     res.append(t_df)

/opt/conda/lib/python3.7/site-packages/cudf/core/dataframe.py in from_pandas(cls, dataframe, nan_as_null)
   3109             # columns for a single key
   3110             if len(vals.shape) == 1:
-> 3111                 df[i] = Series(vals, nan_as_null=nan_as_null)
   3112             else:
   3113                 vals = vals.T

/opt/conda/lib/python3.7/site-packages/cudf/core/series.py in __init__(self, data, index, name, nan_as_null, dtype)
    128 
    129         if not isinstance(data, column.ColumnBase):
--> 130             data = column.as_column(data, nan_as_null=nan_as_null, dtype=dtype)
    131 
    132         if index is not None and not isinstance(index, Index):

/opt/conda/lib/python3.7/site-packages/cudf/core/column/column.py in as_column(arbitrary, nan_as_null, dtype, length)
   1353         elif arb_dtype.kind in ("O", "U"):
   1354             data = as_column(
-> 1355                 pa.Array.from_pandas(arbitrary), dtype=arbitrary.dtype
   1356             )
   1357         else:

/opt/conda/lib/python3.7/site-packages/cudf/core/column/column.py in as_column(arbitrary, nan_as_null, dtype, length)
   1265                 mask=pamask,
   1266                 size=pa_size,
-> 1267                 offset=pa_offset,
   1268             )
   1269 

/opt/conda/lib/python3.7/site-packages/cudf/core/column/numerical.py in __init__(self, data, dtype, mask, size, offset)
     30         dtype = np.dtype(dtype)
     31         if data.size % dtype.itemsize:
---> 32             raise ValueError("Buffer size must be divisible by element size")
     33         if size is None:
     34             size = data.size // dtype.itemsize

ValueError: Buffer size must be divisible by element size

What could be the possible solution?

Jack Daniel
  • 2,527
  • 3
  • 31
  • 52

1 Answers1

0

As in your other question, I believe that you're trying to force cudf into doing something in a way that you really shouldn't. While RAPIDS strives for API familiarity, it seems that:

  1. You're currently not using cudf or cuml best practices. While your intentions are viable, you're not using best practices to accomplish your goal, which we do have resources for.
  2. Although RAPIDS can read what is in your csv, your preprocessing is trying to push a np.array into a single column and cudf can't read that format (giving you your error). You need to change the output to something that RAPIDS can read, like making a columns for each element in that array (code below). This may be a feature gap between pandas and RAPIDS that you've hit, and we encourage you to make a feature request.

If you haven't already, I'd encourage you to go through some of our docs and notebook examples in cuml and cudf on github. We have an arima example notebook that runs on GPU. These are a pretty quick read and will really get you on your way. cudf can handle strings natively with .str, but our apply doesn't work well with strings yet. If your GPU memory is too small to contain all the data, use dask-cudf.

The trickiest part here is reading a dataset that has comma separated string elements within your CSVs. You want each element into its own column along the row - not an array. RAPIDS applys don't work well on strings yet, but what you want to accomplish is very similar to this example code below. Sadly, RAPIDS can take a longer than Pandas on this one. However, the code works for both cudf and pandas and its output is more usable throughout the RAPIDS ecosystem. Now that you made your vectors into columns, see where that gets you with cuml's ARIMA (linked above).

import cudf
df = cudf.read_csv('testset.csv') 
vecnum_cols = ['a'] 
df_vecnum = cudf.DataFrame(index=df.index)

if len(vecnum_cols) >0:
    for vec in vecnum_cols:
        v = df[vec].str.split(",", expand = True).reset_index(drop=True)
        v.columns  = [ vec + '_' + str(i) for i in range(v.shape[1])]
        #print(len(v.columns))
        df_vecnum = df_vecnum.join(v) 
print(df_vecnum.head())

Hope this all helps. I can't guarantee you that it will get you where you want to go, but based on what I saw above, it should get you in the right direction.

TaureanDyerNV
  • 1,208
  • 8
  • 9