6

I'm trying to join two Dataframes with the help of categorical features and the lazy API. I tried to do it the way it was decribed in the user guide(https://pola-rs.github.io/polars-book/user-guide/performance/strings.html)

count = admin_df.groupby(['admin','EVENT_DATE']).pivot(pivot_column='FIVE_TYPE',values_column='count').first().lazy()
fatalities = admin_df.groupby(['admin','EVENT_DATE']).pivot(pivot_column='FIVE_TYPE',values_column='FATALITIES').first().lazy()
fatalities = fatalities.with_column(pl.col("admin").cast(pl.Categorical))
count = count.with_column(pl.col("admin").cast(pl.Categorical))
admin_df = fatalities.join(count,on=['admin','EVENT_DATE']).collect()

but i get the following Error:

    Traceback (most recent call last):
  File "country_level.py", line 33, in <module>
    country_level('/c/Users/Sebastian/feast/fluent_sunfish/data/ACLED_geocoded.parquet')
  File "country_level.py", line 10, in country_level
    country_df=aggregate_by_date(df)
  File "country_level.py", line 29, in aggregate_by_date
    admin_df = fatalities.join(count,on=['admin','EVENT_DATE']).collect()
  File "/home/sebastian/.local/lib/python3.8/site-packages/polars/internals/lazy_frame.py", line 293, in collect
    return pli.wrap_df(ldf.collect())
RuntimeError: Any(ValueError("joins on categorical dtypes can only happen if they are created under the same global string cache"))

with the usage of with pl.StringCache(): everything works fine, altough the user guide says it isn't needed if you use the lazy API, do i missing something or is this a bug?

seb2704
  • 390
  • 1
  • 5
  • 17
  • I was able to reproduce this using the example code you referred (including a `.collect()`, which is not in the example but is needed to trigger the error), and definitely seems like a bug. I have filed a report at https://github.com/pola-rs/polars/issues/1993 I would suggest to use the `with pl.StringCache()` for the time being. – jvz Dec 05 '21 at 15:30

1 Answers1

6

The user guide is incorrect. You need to set a global string cache.

You can set a global string cache with pl.StringCache(): or with pl.Config.set_global_string_cache.


import polars as pl
pl.Config.set_global_string_cache()

lf1 = pl.DataFrame({
    "a": ["foo", "bar", "ham"], 
    "b": [1, 2, 3]
}).lazy()
lf2 = pl.DataFrame({
    "a": ["foo", "spam", "eggs"], 
    "c": [3, 2, 2]
}).lazy()

lf1 = lf1.with_column(pl.col("a").cast(pl.Categorical))
lf2 = lf2.with_column(pl.col("a").cast(pl.Categorical))

lf1.join(df2, on="a", how="inner").collect()

Outputs:

shape: (1, 3)
┌───────┬─────┬─────┐
│ a     ┆ b   ┆ c   │
│ ---   ┆ --- ┆ --- │
│ cat   ┆ i64 ┆ i64 │
╞═══════╪═════╪═════╡
│ "foo" ┆ 1   ┆ 3   │
└───────┴─────┴─────┘

ritchie46
  • 10,405
  • 1
  • 24
  • 43