How can I reduce the amount of data in a polars DataFrame?

Question

I have a csv file with a size of 28 GB, which I want to plot. Those are way too many data points obviously, so how can I reduce the data? I would like to merge about 1000 data points into one by calculating the mean. This is the sturcture of my DataFrame:

Time in seconds	Force in N
f64	f64
0.0	2310.18
0.0005	2313.23
0.001	2314.14

I thought about using groupby_dynamic, and then calculating the mean of each group, but this only seems to work when using datetimes? The time in seconds is given as a float however.

jqurious · Accepted Answer · 2023-08-23T08:24:12.090

1

You can also group by an integer column to create groups of size N:

In case of a groupby_dynamic on an integer column, the windows are defined by:

“1i” # length 1

“10i” # length 10

We can use .int_range() to add an integer row count to group on:

df = pl.DataFrame({"force": ["A", "B", "C", "D", "E", "F", "G"]})

(df.with_columns(row_nr = pl.int_range(0, pl.count()))
   .groupby_dynamic(
      index_column = "row_nr",
      every = "2i" 
   )
   .agg("force")
)

shape: (4, 2)
┌────────┬────────────┐
│ row_nr ┆ force      │
│ ---    ┆ ---        │
│ i64    ┆ list[str]  │
╞════════╪════════════╡
│ 0      ┆ ["A", "B"] │
│ 2      ┆ ["C", "D"] │
│ 4      ┆ ["E", "F"] │
│ 6      ┆ ["G"]      │
└────────┴────────────┘

edited Aug 23 '23 at 08:24

answered Aug 23 '23 at 08:17

jqurious

9,953
1
4
14

What is `pl`? `pandas.DataFrame` has no attribute `with_columns` – jlgarcia Aug 23 '23 at 08:19
@jlgarcia [polars](https://github.com/pola-rs/polars) – jqurious Aug 23 '23 at 08:20
This worked really well, thank you! However, my kernel crashes when using the full dataset (propably due to memory?) using chuncks of 1000i. Is there a way to do this more efficiently? – Jan Aug 23 '23 at 08:33
1

solved this by using `pl.read_csv_batched()`, which had a small learning curve itself :D Works great now though :) – Jan Aug 23 '23 at 13:09

score 0 · Answer 2 · answered Aug 23 '23 at 08:31

0

Another version, using/assuming pandas dataframes to reduce your dataframe's shape using N-points mean:

# 1000 = how many points you want to aggregate
s = (df.index.to_series() / 1000).astype(int) 
df.groupby(s).mean()

answered Aug 23 '23 at 08:31

jlgarcia

333
6

How can I reduce the amount of data in a polars DataFrame?

2 Answers2