1

In Pandas I can add new rows by their index and forward fill in values without filling any other nulls in the DataFrame:

import numpy as np
import pandas as pd


df = pd.DataFrame(data={"a": [1.0, 2.0, np.nan, 3.0]}, index=pd.date_range("2020", periods=4, freq="T"))
print(df)
df = df.reindex(index=df.index.union(pd.date_range("2020-01-01 00:01:30", periods=2, freq="T")), method="ffill")
print(df)

Giving output

                       a
2020-01-01 00:00:00  1.0
2020-01-01 00:01:00  2.0
2020-01-01 00:02:00  NaN
2020-01-01 00:03:00  3.0
                       a
2020-01-01 00:00:00  1.0
2020-01-01 00:01:00  2.0
2020-01-01 00:01:30  2.0
2020-01-01 00:02:00  NaN
2020-01-01 00:02:30  NaN
2020-01-01 00:03:00  3.0

Is it possible to achieve something similar using Polars? I am using Polars mainly because it has better performance for my data so far, so performance matters.

I can think of concat -> sort -> ffill methods, something like:

    let new_index_values = new_index_values.into_series().into_frame();
    let new_index_values_len = new_index_values.height();

    let mut cols = vec![new_index_values];
    let col_names = source.get_column_names();
    for col_name in col_names.clone() {
        if col_name != index_column {
            cols.push(
                Series::full_null(
                    col_name,
                    new_index_values_len,
                    source.column(col_name)?.dtype(),
                )
                .into_frame(),
            )
        }
    }

    let range_frame = hor_concat_df(&cols)?.select(col_names)?;

    concat([source.clone().lazy(), range_frame.lazy()], true, true)?
        .sort(
            index_column,
            SortOptions {
                descending: false,
                nulls_last: true,
            },
        )
        .collect()?
        .fill_null(FillNullStrategy::Forward(Some(1)))?
        .unique(Some(&[index_column.into()]), UniqueKeepStrategy::Last)

but this will fill other nulls than the ones that were added. I need to preserve the nulls in the original data, so that does not work for me.

1 Answers1

1

I'm not familiar with Rust so this would be the python way to do it (or at least how I would approach it).

Starting with:

pldf = pl.DataFrame({
   "dt":pl.date_range(datetime(2020,1,1), datetime(2020,1,1,0,3), "1m"),
    "a": [1.0, 2.0, None, 3.0]
})

and then you want to add

new_rows = pl.DataFrame({
    "dt": pl.date_range(datetime(2020,1,1,0,1,30), datetime(2020,1,1,0,2,30), "1m")
})

All I've done is convert the pandas date_range syntax to the polars one.

To put those together, use a join_asof. Since these Frames were constructed with date_range, they're already in order but if real data is constructed a different way, ensure you sort them first.

new_rows = new_rows.join_asof(pldf, on='dt')

This just gives you the actual new_rows and then you can concat them together to get to your final answer.

pldf = pl.concat([pldf, new_rows]).sort('dt')
Dean MacGregor
  • 11,847
  • 9
  • 34
  • 72