4

I am using python 2.7 with dask

I have a dataframe with one column of tuples that I created like this:

table[col] = table.apply(lambda x: (x[col1],x[col2]), axis = 1, meta = pd.Dataframe) 

I want to re convert this tuple column into two seperate columns In pandas I would do it like this:

table[[col1,col2]] = table[col].apply(pd.Series) 

The point of doing so, is that dask dataframe does not support multi index and i want to use groupby according to multiple columns, and wish to create a column of tuples that will give me a single index containing all the values I need (please ignore efficiency vs multi index, for there is not yet a full support for this is dask dataframe)

When i try to unpack the tuple columns with dask using this code:

rxTable[["a","b"]] = rxTable["tup"].apply(lambda x: s(x), meta = pd.DataFrame, axis = 1)

I get this error

AttributeError: 'Series' object has no attribute 'columns'

when I try

rxTable[["a","b"]] = rxTable["tup"].apply(dd.Series, axis = 1, meta = pd.DataFrame)

I get the same

How can i take a column of tuples and convert it to two columns like I do in Pandas with no problem?

Thanks

thebeancounter
  • 4,261
  • 8
  • 61
  • 109

2 Answers2

1

Best i found so for in converting into pandas dataframe and then convert the column, then go back to dask

df1 = df.compute()
df1[["a","b"]] = df1["c"].apply(pd.Series)
df = dd.from_pandas(df1,npartitions=1)

This will work well, if the df is too big for memory, you can either: 1.compute only the wanted column, convert it into two columns and then use merge to get the split results into the original df 2.split the df into chunks, then converting each chunk and adding it into an hd5 file, then using dask to read the entire hd5 file into the dask dataframe

thebeancounter
  • 4,261
  • 8
  • 61
  • 109
1

I found this methodology works well and avoids converting the Dask DataFrame to Pandas:

df['a'] = df['tup'].str.partition(sep)[0]
df['b'] = df['tup'].str.partition(sep)[2]

where sep is whatever delimiter you were using in the column to separate the two elements.

Dirigo
  • 323
  • 1
  • 4
  • 12