0

In this example i have three columns, the 'DayOfWeek' Time' and the 'Risk'.

I want to group by 'DayOfWeek' and take the first element only and assign a high risk on it. This means the first known hour in day of week is the one that has the highest risk. The rest is initialized to 'Low' risk.

In pandas i had an additional column for the index, but in polars i do not. I could artificially create one, but is it even necessary?

Can i do this somehow smarter with polars?

df['risk'] = "Low"

df = df.sort('Time')
df.loc[df.groupby("DayOfWeek").head(1).index, "risk"] = "High"

The index is unique in this case and goes to range(n)

Here is my solution btw. (I don't really like it)

df = df.with_column(pl.arange(0, df.shape[0]).alias('pseudo_index')

# find lowest time for day
indexes_df = df.sort('Time').groupby('DayOfWeek').head(1)
# Set 'High' as col for all rows from groupby

indexes_df = indexes_df.select('pseudo_index').with_column(pl.lit('High').alias('risk'))

# Left join will generate null values for all values that are not in indexes_df 'pseudo_index'
df = df.join(indexes_df, how='left', on='pseudo_index').select([
pl.all().exclude(['pseudo_index', 'risk']), pl.col('risk').fill_null(pl.lit('low'))
])
zacko
  • 179
  • 2
  • 9

1 Answers1

1

You can use window functions to find where the first "index" of the "DayOfWeek" group equals the"index" column.

For that we only need to set an "index" column. We can do that easily with:

  • A method: df.with_row_count(<name>)
  • An expression: pl.arange(0, pl.count()).alias(<name>)

After that we can use this predicate:

pl.first("index").over("DayOfWeek") == pl.col("index")

Finally we use a when -> then -> otherwise expression to use that condition and create our new "Risk" column.

Example

Let's start with some data. In the snippet below I create an hourly date range and then determine the weekdays from that.

Preparing data

df = pl.DataFrame({
    "Time": pl.date_range(datetime(2022, 6, 1), datetime(2022, 6, 30), "1h").sample(frac=1.5, with_replacement=True).sort(),
}).select([
    pl.arange(0, pl.count()).alias("index"),
    pl.all(),
    pl.col("Time").dt.weekday().alias("DayOfWeek"),
])

print(df)
shape: (1045, 3)
┌───────┬─────────────────────┬───────────┐
│ index ┆ Time                ┆ DayOfWeek │
│ ---   ┆ ---                 ┆ ---       │
│ i64   ┆ datetime[ns]        ┆ u32       │
╞═══════╪═════════════════════╪═══════════╡
│ 0     ┆ 2022-06-29 22:00:00 ┆ 3         │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 1     ┆ 2022-06-14 11:00:00 ┆ 2         │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 2     ┆ 2022-06-11 21:00:00 ┆ 6         │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 3     ┆ 2022-06-27 20:00:00 ┆ 1         │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ ...   ┆ ...                 ┆ ...       │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 1041  ┆ 2022-06-11 09:00:00 ┆ 6         │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 1042  ┆ 2022-06-18 22:00:00 ┆ 6         │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 1043  ┆ 2022-06-18 01:00:00 ┆ 6         │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 1044  ┆ 2022-06-23 18:00:00 ┆ 4         │
└───────┴─────────────────────┴───────────┘


Computing Risk values

df.with_column(
    pl.when(
        pl.first("index").over("DayOfWeek") == pl.col("index")
    ).then(
        "High"    
    ).otherwise(
        "Low"
    ).alias("Risk")
).drop("index")

print(df)
shape: (1045, 3)
┌─────────────────────┬───────────┬──────┐
│ Time                ┆ DayOfWeek ┆ Risk │
│ ---                 ┆ ---       ┆ ---  │
│ datetime[ns]        ┆ u32       ┆ str  │
╞═════════════════════╪═══════════╪══════╡
│ 2022-06-29 22:00:00 ┆ 3         ┆ High │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2022-06-14 11:00:00 ┆ 2         ┆ High │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2022-06-11 21:00:00 ┆ 6         ┆ High │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2022-06-27 20:00:00 ┆ 1         ┆ High │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ ...                 ┆ ...       ┆ ...  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2022-06-11 09:00:00 ┆ 6         ┆ Low  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2022-06-18 22:00:00 ┆ 6         ┆ Low  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2022-06-18 01:00:00 ┆ 6         ┆ Low  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2022-06-23 18:00:00 ┆ 4         ┆ Low  │
└─────────────────────┴───────────┴──────┘

ritchie46
  • 10,405
  • 1
  • 24
  • 43
  • The problem is that ```pl.first("Time").over("DayOfWeek") == pl.col("Time")``` might match to more than one row since 'Time' contains duplicates. (Your example ofc doesn't contain duplicates) Only one entry is allowed to be high risk. – zacko Jun 21 '22 at 09:10
  • Alright, I have updated my answer accordingly. Next time it helps to add some dummy data in the question itself that meets your constraints. – ritchie46 Jun 21 '22 at 09:40
  • My bad, will do. The example doesn't seem to work. The first thing that confuses me is that ```pl.date_range(...).sample(...).sort()``` Isn't sorting at all. The result is ```2022-06-17 02:00:00 2022-06-07 04:00:00 2022-06-06 23:00:00 2022-06-29 22:00:00``` – zacko Jun 21 '22 at 10:10
  • 1
    I see that, is a bug in polars. We should unset the `sorted` flag during a `sample` operation. Now we don't sort, because we think the data is already sorted. This will be fixed in next weeks release. To fix the example use: `s = pl.date_range(datetime(2022, 6, 1), datetime(2022, 6, 30), "1h").sample(frac=1.5, with_replacement=True).set_sorted(False).sort()` – ritchie46 Jun 21 '22 at 10:21