1

I have a pandas dataframe, which for simplicity I'll mark as:

Column1 Column2 Column3
0 0 0
0 0 0
0 0 0

And I have a function that transforms the data, for example:

def mutation(df, idx):
    df.iloc[idx] += 1
    return df

I'd like to hold each possible mutation applied to this dataset in a variable. For example:

var1 =

Column1 Column2 Column3
1 0 0
0 0 0
0 0 0

var2 =

Column1 Column2 Column3
0 0 0
1 0 0
0 0 0

. . .

var9 =

Column1 Column2 Column3
0 0 0
0 0 0
0 0 1

And so on. I will hold (rows x columns)^n different variables (where n is the number of mutations I am applying), where the difference between each one is small. The problem is that my mutation is in-place - they share the same data and one mutation will apply on all variables.

Instead, I can mutate with a deepcopy:

def immutable_mutation(df, idx):
    df = df.copy(deep=True)
    df.iloc[idx] += 1

    return df

The problem is that it creates (rows x columns)^n duplicates of my data instead of "just" holding the initial dataset and it's mutations.

My question is - is there any way to apply these mutations in an immutable way that does not require a deepcopy?

I am willing to migrate to Polars/Spark or any other library for that matter.

  • The mutation is always between 0 and 1 or you have other numbers? – Corralien Mar 26 '23 at 19:17
  • This is a simplification as the actual system is too complex to share here, the data itself is mostly (categorical) strings. It can be encoded using integers but it's not binary. Also, there are many such mutations that exist - which also mean that n gets big quickly. – Null Terminator Mar 26 '23 at 19:50

1 Answers1

0

With polars you use .with_columns() to return a new dataframe.

df[...] = syntax does currently exist, and it mutates.

frames = [
   df.with_columns(
      column.set_at_idx(idx, column[idx] + 1)
   )
   for idx in range(df.height)
   for column in df
]
>>> len(frames)
9
>>> frames[0]
shape: (3, 3)
┌─────────┬─────────┬─────────┐
│ Column1 ┆ Column2 ┆ Column3 │
│ ---     ┆ ---     ┆ ---     │
│ f32     ┆ f32     ┆ f32     │
╞═════════╪═════════╪═════════╡
│ 1.0     ┆ 0.0     ┆ 0.0     │
│ 0.0     ┆ 0.0     ┆ 0.0     │
│ 0.0     ┆ 0.0     ┆ 0.0     │
└─────────┴─────────┴─────────┘
>>> frames[-1]
shape: (3, 3)
┌─────────┬─────────┬─────────┐
│ Column1 ┆ Column2 ┆ Column3 │
│ ---     ┆ ---     ┆ ---     │
│ f32     ┆ f32     ┆ f32     │
╞═════════╪═════════╪═════════╡
│ 0.0     ┆ 0.0     ┆ 0.0     │
│ 0.0     ┆ 0.0     ┆ 0.0     │
│ 0.0     ┆ 0.0     ┆ 1.0     │
└─────────┴─────────┴─────────┘

The .set_at_idx docs say its usage is an anti-pattern and when/then should be preferred.

frames = [
   df.with_columns(
      pl.when(pl.arange(0, pl.count()) == N)
        .then(pl.col(name) + 1)
        .otherwise(pl.col(name))
        .keep_name()
   )
   for N in range(df.height)
   for name in df.columns
]
jqurious
  • 9,953
  • 1
  • 4
  • 14
  • Thanks for the reply! Unfortunately my knowledge in Polars is exactly nil - how does it work under the hood? Does it copy the data or does it have a more complicated model that shares most of the memory? – Null Terminator Mar 26 '23 at 21:21
  • @NullTerminator "Copy on write on steroids": https://stackoverflow.com/a/73934361 – jqurious Mar 26 '23 at 21:26