3

I'm trying to use dask to process a dataset which does not fit into memory. It's time series data for various "IDs". After reading dask documentation, I chose to use the "parquet" file format and partitioning by "ID".

However, while reading from parquet, and setting the index I encountered a "TypeError: to union ordered Categoricals, all categories must be the same" which I did not manage to solve by myself.

This code replicates the issue I'm having:

import dask.dataframe as dd
import numpy as np
import pandas as pd
import traceback

# create ids
ids = ["AAA", "BBB", "CCC", "DDD"]

# create data
df = pd.DataFrame(index=np.random.choice(ids, 50), data=np.random.rand(50, 1), columns=["FOO"]).reset_index().rename(columns={"index": "ID"})
# serialize  to parquet
f = r"C:/temp/foo.pq"
df.to_parquet(f, compression='gzip', engine='fastparquet', partition_cols=["ID"])
# read with dask
df = dd.read_parquet(f)

try:
    df = df.set_index("ID")
except Exception as ee:
    print(traceback.format_exc())

at this point I get the following error:

~\.conda\envs\env_dask_py37\lib\site-packages\pandas\core\arrays\categorical.py in check_for_ordered(self, op)
   1492         if not self.ordered:
   1493             raise TypeError(
-> 1494                 f"Categorical is not ordered for operation {op}\n"
   1495                 "you can use .as_ordered() to change the "
   1496                 "Categorical to an ordered one\n"

TypeError: Categorical is not ordered for operation max
you can use .as_ordered() to change the Categorical to an ordered one

I then did:

# we order the categorical
df.ID = df.ID.cat.as_ordered()
df = df.set_index("ID")

And, when I'm trying to use df.compute(scheduler="processes"), I get the TypeError I mentioned before:

try:
    schd_str = 'processes'
    aa = df.compute(scheduler=schd_str)
    print(f"{schd_str}: OK")
except:
    print(f"{schd_str}: KO")
    print(traceback.format_exc())

gives:

Traceback (most recent call last):
  File "<ipython-input-6-e15c4e86fee2>", line 3, in <module>
    aa = df.compute(scheduler=schd_str)
  File "C:\Users\xxx\.conda\envs\env_dask_py37\lib\site-packages\dask\base.py", line 166, in compute
    (result,) = compute(self, traverse=False, **kwargs)
  File "C:\Users\xxx\.conda\envs\env_dask_py37\lib\site-packages\dask\base.py", line 438, in compute
    return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)])
  File "C:\Users\xxx\.conda\envs\env_dask_py37\lib\site-packages\dask\base.py", line 438, in <listcomp>
    return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)])
  File "C:\Users\xxx\.conda\envs\env_dask_py37\lib\site-packages\dask\dataframe\core.py", line 103, in finalize
    return _concat(results)
  File "C:\Users\xxx\.conda\envs\env_dask_py37\lib\site-packages\dask\dataframe\core.py", line 98, in _concat
    else methods.concat(args2, uniform=True, ignore_index=ignore_index)
  File "C:\Users\xxx\.conda\envs\env_dask_py37\lib\site-packages\dask\dataframe\methods.py", line 383, in concat
    ignore_index=ignore_index,
  File "C:\Users\xxx\.conda\envs\env_dask_py37\lib\site-packages\dask\dataframe\methods.py", line 431, in concat_pandas
    ind = concat([df.index for df in dfs])
  File "C:\Users\xxx\.conda\envs\env_dask_py37\lib\site-packages\dask\dataframe\methods.py", line 383, in concat
    ignore_index=ignore_index,
  File "C:\Users\xxx\.conda\envs\env_dask_py37\lib\site-packages\dask\dataframe\methods.py", line 400, in concat_pandas
    return pd.CategoricalIndex(union_categoricals(dfs), name=dfs[0].name)
  File "C:\Users\xxx\.conda\envs\env_dask_py37\lib\site-packages\pandas\core\dtypes\concat.py", line 352, in union_categoricals
    raise TypeError("Categorical.ordered must be the same")
TypeError: Categorical.ordered must be the same

Surprisingly enough, using df.compute(scheduler="threads"), df.compute(scheduler="synchronous"), or not setting the index at all works properly.

However, it does not seem to be something I should do since I'm actually trying to merge several of these datasets, and thought that setting the index would result in a speed-up over not setting any. (I'm getting the same exact error when trying to merge two dataframes indexed this way)

I tried to inspect df._meta, and it turns out my categories are "known" as they should be? dask-categoricals

I also read this github post about something that looks similar but somehow did not find a solution.

Thanks for your help,

Kalendil
  • 31
  • 1

0 Answers0