0

I have a Dask.Series with a categorical dtype that is known. I want to create a little dataframe which shows the associated mapping without having to compute the entire series. How do I achieve this?

import pandas as pd
import dask.dataframe as dd
from dask_ml.preprocessing import Categorizer

df = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')
df = dd.from_pandas(df, npartitions = 2)
df = Categorizer().fit_transform(df)

test = df['species']

The above code creates a category series in dask. By using test.cat.codes, I can convert the categories into codes like the below:


> test.compute()
Out[5]: 
0         setosa
1         setosa
2         setosa
3         setosa
4         setosa
   
145    virginica
146    virginica
147    virginica
148    virginica
149    virginica
Name: species, Length: 150, dtype: category
Categories (3, object): [setosa, versicolor, virginica]

> test.cat.codes.compute()
Out[6]: 
0      0
1      0
2      0
3      0
4      0
      ..
145    2
146    2
147    2
148    2
149    2
Length: 150, dtype: int8

The desired outcome is to get a mapping from the categories to the codes as shown below without using a compute command until the very end.

Desired output:

Category      Code
setosa        0
versicolor    1
virginica     2

I have tried lots of things, but they all require converting the series into a pandas series or dataframe, which defeats the purpose of using dask. I haven't found anything in dask which would help me do this without re-partitioning, which I do not want to do. Also note that while the example has access to the DataFrame for setup purposes, I do not actually have access to an original dataframe so it would need to start with the series "test".

WolVes
  • 1,286
  • 2
  • 19
  • 39

1 Answers1

1

How about the following:

category_mapping = dd.concat([test, test.cat.codes], axis=1)
category_mapping.columns = ["Category", "Code"]
category_mapping = category_mapping.drop_duplicates()
print(category_mapping.compute())

which would give you:

       Category  Code
0        setosa     0
50   versicolor     1
100   virginica     2
BStadlbauer
  • 1,287
  • 6
  • 18
  • Note above that I do not have access to the original dataframe and only the dd.Series itself. Your first line maintains a DataFrame structure with the double brackets. As a result, your second line of code does not work in the desired solution, as I only have the Series which cannot be appended to like the dataframe. – WolVes Nov 16 '20 at 19:03
  • @WolVes, that shouldn't actually matter, that was more of a convienience - I have edited my answer so that it works with the `test` (dask) series only – BStadlbauer Nov 16 '20 at 19:10
  • Hahah you say that, but the part I was struggling with was trying to get it back into a DF! IDK why i didnt think about concat. So simple! Thanks for your help @Bsadlbauer – WolVes Nov 16 '20 at 19:20
  • 1
    No worries, you are welcome! You could also use `dd.Series.to_frame()` if you just want to convert a Series into a DataFrame – BStadlbauer Nov 16 '20 at 19:24