3

I have a csv file with a size of 28 GB, which I want to plot. Those are way too many data points obviously, so how can I reduce the data? I would like to merge about 1000 data points into one by calculating the mean. This is the sturcture of my DataFrame:

Time in seconds Force in N
f64 f64
0.0 2310.18
0.0005 2313.23
0.001 2314.14

I thought about using groupby_dynamic, and then calculating the mean of each group, but this only seems to work when using datetimes? The time in seconds is given as a float however.

Jan
  • 157
  • 9

2 Answers2

1

You can also group by an integer column to create groups of size N:

In case of a groupby_dynamic on an integer column, the windows are defined by:

“1i” # length 1

“10i” # length 10

We can use .int_range() to add an integer row count to group on:

df = pl.DataFrame({"force": ["A", "B", "C", "D", "E", "F", "G"]})

(df.with_columns(row_nr = pl.int_range(0, pl.count()))
   .groupby_dynamic(
      index_column = "row_nr",
      every = "2i" 
   )
   .agg("force")
)
shape: (4, 2)
┌────────┬────────────┐
│ row_nr ┆ force      │
│ ---    ┆ ---        │
│ i64    ┆ list[str]  │
╞════════╪════════════╡
│ 0      ┆ ["A", "B"] │
│ 2      ┆ ["C", "D"] │
│ 4      ┆ ["E", "F"] │
│ 6      ┆ ["G"]      │
└────────┴────────────┘
jqurious
  • 9,953
  • 1
  • 4
  • 14
  • What is `pl`? `pandas.DataFrame` has no attribute `with_columns` – jlgarcia Aug 23 '23 at 08:19
  • @jlgarcia [polars](https://github.com/pola-rs/polars) – jqurious Aug 23 '23 at 08:20
  • This worked really well, thank you! However, my kernel crashes when using the full dataset (propably due to memory?) using chuncks of 1000i. Is there a way to do this more efficiently? – Jan Aug 23 '23 at 08:33
  • 1
    solved this by using `pl.read_csv_batched()`, which had a small learning curve itself :D Works great now though :) – Jan Aug 23 '23 at 13:09
0

Another version, using/assuming pandas dataframes to reduce your dataframe's shape using N-points mean:

# 1000 = how many points you want to aggregate
s = (df.index.to_series() / 1000).astype(int) 
df.groupby(s).mean()
jlgarcia
  • 333
  • 6