1

I have a DataFrame that looks like this:

import polars as pl

df = pl.DataFrame(
    {
        "country": ["France", "France", "UK", "UK", "Spain"],
        "year": [2020, 2021, 2019, 2020, 2022],
        "value": [1, 2, 3, 4, 5],
    }
)

df

shape: (5, 3)
┌─────────┬──────┬───────┐
│ country ┆ year ┆ value │
│ ---     ┆ ---  ┆ ---   │
│ str     ┆ i64  ┆ i64   │
╞═════════╪══════╪═══════╡
│ France  ┆ 2020 ┆ 1     │
│ France  ┆ 2021 ┆ 2     │
│ UK      ┆ 2019 ┆ 3     │
│ UK      ┆ 2020 ┆ 4     │
│ Spain   ┆ 2022 ┆ 5     │
└─────────┴──────┴───────┘

I'd like to make a balanced panel by creating all country-year pairs. In R, I could use tidyr::complete() for this, but I didn't find a built-in way to do this in Polars. Is there something like this? If not, what would be the fastest way to mimick it?

Expected output:

shape: (12, 3)
┌─────────┬──────┬───────┐
│ country ┆ year ┆ value │
│ ---     ┆ ---  ┆ ---   │
│ str     ┆ i64  ┆ i64   │
╞═════════╪══════╪═══════╡
│ France  ┆ 2019 ┆ null  │
│ France  ┆ 2020 ┆ 1     │
│ France  ┆ 2021 ┆ 2     │
│ France  ┆ 2022 ┆ null  │
│ UK      ┆ 2019 ┆ 3     │
│ UK      ┆ 2020 ┆ 4     │
│ UK      ┆ 2021 ┆ null  │
│ UK      ┆ 2022 ┆ null  │
│ Spain   ┆ 2019 ┆ null  │
│ Spain   ┆ 2020 ┆ null  │
│ Spain   ┆ 2021 ┆ null  │
│ Spain   ┆ 2022 ┆ 5     │
└─────────┴──────┴───────┘


Edit: the example above is quite simple because it only has 2 vars to complete but it started being trickier with 3 vars and I don't see how to adapt the pivot() + melt():

import polars as pl

df = pl.DataFrame(
    {
        "orig": ["France", "France", "UK", "UK", "Spain"],
        "dest": ["Japan", "Vietnam", "Japan", "China", "China"],
        "year": [2020, 2021, 2019, 2020, 2022],
        "value": [1, 2, 3, 4, 5],
    }
)
df

shape: (5, 4)
┌────────┬─────────┬──────┬───────┐
│ orig   ┆ dest    ┆ year ┆ value │
│ ---    ┆ ---     ┆ ---  ┆ ---   │
│ str    ┆ str     ┆ i64  ┆ i64   │
╞════════╪═════════╪══════╪═══════╡
│ France ┆ Japan   ┆ 2020 ┆ 1     │
│ France ┆ Vietnam ┆ 2021 ┆ 2     │
│ UK     ┆ Japan   ┆ 2019 ┆ 3     │
│ UK     ┆ China   ┆ 2020 ┆ 4     │
│ Spain  ┆ China   ┆ 2022 ┆ 5     │
└────────┴─────────┴──────┴───────┘

While the original works, it is much slower than tidyr::complete() (66ms for Polars, 1.8ms for tidyr::complete()):

import time

tic = time.perf_counter()
(
    df
    .select("orig")
    .unique()
    .join(df.select("dest").unique(), how="cross")
    .join(df.select("year").unique(), how="cross")
    .join(df, how="left", on=["country", "year"])
)
toc = time.perf_counter()
print(f"Lazy eval: {toc - tic:0.4f} seconds")

shape: (36, 4)
┌───────┬─────────┬──────┬───────┐
│ orig  ┆ dest    ┆ year ┆ value │
│ ---   ┆ ---     ┆ ---  ┆ ---   │
│ str   ┆ str     ┆ i64  ┆ i64   │
╞═══════╪═════════╪══════╪═══════╡
│ Spain ┆ Japan   ┆ 2021 ┆ null  │
│ Spain ┆ Japan   ┆ 2022 ┆ null  │
│ Spain ┆ Japan   ┆ 2019 ┆ null  │
│ Spain ┆ Japan   ┆ 2020 ┆ null  │
│ …     ┆ …       ┆ …    ┆ …     │
│ UK    ┆ Vietnam ┆ 2021 ┆ null  │
│ UK    ┆ Vietnam ┆ 2022 ┆ null  │
│ UK    ┆ Vietnam ┆ 2019 ┆ null  │
│ UK    ┆ Vietnam ┆ 2020 ┆ null  │
└───────┴─────────┴──────┴───────┘
>>>
>>> toc = time.perf_counter()
>>> print(f"Lazy eval: {toc - tic:0.4f} seconds")
Lazy eval: 0.0669 seconds

In R:

test <- data.frame(
  orig = c("France", "France", "UK", "UK", "Spain"),
  dest = c("Japan", "Vietnam", "Japan", "China", "China"),
  year = c(2020, 2021, 2019, 2020, 2022),
  value = c(1, 2, 3, 4, 5)
)

bench::mark(
  test = tidyr::complete(test, orig, dest, year),
  iterations = 100
)
#> # A tibble: 1 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 test         1.61ms   1.81ms      496.     4.6MB     10.1
bretauv
  • 7,756
  • 2
  • 20
  • 57

2 Answers2

2

Perhaps there is a simpler way but the combinations are a cross join of the unique values.

df.select('country').unique().join(
   df.select('year').unique(),
   how = 'cross'
)
shape: (12, 2)
┌─────────┬──────┐
│ country ┆ year │
│ ---     ┆ ---  │
│ str     ┆ i64  │
╞═════════╪══════╡
│ UK      ┆ 2021 │
│ UK      ┆ 2022 │
│ UK      ┆ 2019 │
│ UK      ┆ 2020 │
│ Spain   ┆ 2021 │
│ Spain   ┆ 2022 │
│ Spain   ┆ 2019 │
│ Spain   ┆ 2020 │
│ France  ┆ 2021 │
│ France  ┆ 2022 │
│ France  ┆ 2019 │
│ France  ┆ 2020 │
└─────────┴──────┘

Which you can left join with the original:

df.select('country').unique().join(
   df.select('year').unique(),
   how = 'cross'
).join(df, how='left', on=['country', 'year'])
shape: (12, 3)
┌─────────┬──────┬───────┐
│ country ┆ year ┆ value │
│ ---     ┆ ---  ┆ ---   │
│ str     ┆ i64  ┆ i64   │
╞═════════╪══════╪═══════╡
│ UK      ┆ 2021 ┆ null  │
│ UK      ┆ 2019 ┆ 3     │
│ UK      ┆ 2022 ┆ null  │
│ UK      ┆ 2020 ┆ 4     │
│ France  ┆ 2021 ┆ 2     │
│ France  ┆ 2019 ┆ null  │
│ France  ┆ 2022 ┆ null  │
│ France  ┆ 2020 ┆ 1     │
│ Spain   ┆ 2021 ┆ null  │
│ Spain   ┆ 2019 ┆ null  │
│ Spain   ┆ 2022 ┆ 5     │
│ Spain   ┆ 2020 ┆ null  │
└─────────┴──────┴───────┘
jqurious
  • 9,953
  • 1
  • 4
  • 14
  • Thank you for the two alternatives. My example only had two variables to combine but it's a bit trickier to implement these two alternatives with 3 or more vars. For `pivot()` and `melt()`, I simply couldn't find how to do it, and for the cross join it works but it's quite slower than `tidyr::complete()`. I edited my post with these additional details – bretauv Jul 05 '23 at 11:05
  • Ah okay, I've removed reference to the pivot answer. Does the timing difference remain if you use a larger dataset? – jqurious Jul 05 '23 at 11:28
  • I just tried with a dataset where each of the 3 vars has 291 values, so it gives 291^3 ~ 24M combinations. Your code does this in about 1.5sec but doesn't sort the data, `tidyr::complete()` does it in 2.5sec and sorts the data based on the 3 vars. If I add a `sort()` after the `join()`, the timing of your code goes up to about 5 sec – bretauv Jul 05 '23 at 11:42
  • Perhaps it's worth taking up on the [issues tracker](https://github.com/pola-rs/polars/issues/) as a feature request. If a polars specific equivalent of `complete()` existed it could probably be optimized. – jqurious Jul 05 '23 at 11:54
  • That's what I had in mind, thanks – bretauv Jul 05 '23 at 12:00
  • https://github.com/pola-rs/polars/issues/9722 – bretauv Jul 05 '23 at 12:13
  • 1
    The [unique/implode/explode approach here](https://stackoverflow.com/a/76335253/19355181) seems to be faster than the cross joins. – jqurious Jul 06 '23 at 16:37
2

As pointed out by @jqurious in the comments of their answer, it is faster to use .implode() and .explode() (it isn't faster with the small example I gave but I can clearly see the difference with larger data):

(
    df.select(pl.col(["orig", "dest", "year"]).unique().sort().implode())
    .explode("orig")
    .explode("dest")
    .explode("year")
    .join(df, how="left", on=["orig", "dest", "year"])
)
bretauv
  • 7,756
  • 2
  • 20
  • 57