4

I have a dataframe like this:

import pandas as pd
df = pd.DataFrame({
    "time": [1, 2, 1, 2], 
    "site": ['a', 'a', 'b', 'b'], 
    "val1": [11, 12, 21, 22],
    "val2": [101, 102, 201, 202]
})
df.set_index(['time', 'site'], inplace=True, append=False)
df = df.unstack("site")
print df

     val1     val2     
site    a   b    a    b
time                   
1      11  21  101  201
2      12  22  102  202

I would like change some values that match a boolean filter. e.g.:

ix = df.val1 > 20
print ix

site      a     b
time             
1     False  True
2     False  True

A natural thing to try would be df.val1[ix] = 50. This does the expected assignment, but gives a warning: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead.

So now I am trying to achieve something similar using df.loc. But I can't find any way to use df.loc with this kind of boolean mask. This seems to be because I am using hierarchical columns, i.e., I don't have much trouble if I only have one set of values (val1). Unfortunately, assignments with boolean filters on hierarchical columns aren't covered very well in the docs.

I've tried referring to df.loc[:,'val1',ix], but that gives IndexingError: Too many indexers. I've tried df.loc[:,'val1'][ix] = 50, and that works but gives the SettingWithCopyWarning.

I can use df.val1 = df.val1.where(~ix, other=50) but this seems unintuitive, inefficient and inflexible (e.g., it couldn't easily be extended to add 10 to the existing values).

Is there some other indexing method I should use to assign values to a hierarchical column of a dataframe, based on a boolean mask?

Edited to extend the question:

I didn't realize this would be an issue, but I'd actually like to filter based on values in both the val1 and val2 columns and change values in both sets of columns, something like this:

ix = (df.val1 > 20) | (df.val2 < 102)
df.val1[ix] = 50
df.val2[ix] = 150

Is there a simple indexing approach that can do this? It's pretty easy with numpy ndarrays, but seems to be much trickier with a pandas dataframe.

Matthias Fripp
  • 17,670
  • 5
  • 28
  • 45

2 Answers2

4

you can just use a list to select you column

idx = df[['val1']] > 20

idx
Out[39]: 
       val1      
site      a     b
time             
1     False  True
2     False  True

df[idx] = 50

df
Out[41]: 
     val1     val2     
site    a   b    a    b
time                   
1      11  50  101  201
2      12  50  102  202
cncggvg
  • 657
  • 6
  • 12
  • Thanks, this is a great answer to the question I asked. Unfortunately, I forgot to mention that I'd also like to change corresponding entries in the val2 column, something like `ix = (df.val1 > 20) | (df.val2 > 200); df.val1[ix] = 50; df.val2[ix] = 150`. Any idea how to do that? It's pretty simple with standard numpy ndarrays, but seems to be trickier in pandas. – Matthias Fripp Apr 01 '16 at 16:36
1

This problem arises when you first select a series from a dataframe by column name and then try to use a boolean mask and assign values to it. Specifically, the assignment with a boolean mask gets converted internally to extracted_data.where(-mask, other=value, inplace=True), and that raises the SettingWithCopyWarning.

It would be very nice if pandas could just guarantee that this kind of operation will alter the original dataframe, rather than raising this warning. (And by the way, if the order of chained operations is reversed, df[ix]["val1"] = 500 or df[ix][["val1", "val2"]] = 500 give no warning but fail to update the original dataframe). Until this gets resolved, there are a couple of workarounds.

(1) Inspired by the answer from @cncggvg: construct a single index that specifies all the elements that need to be updated, rather than chaining two indexing operations together.

# create a partial index for the boolean operation
# note: this specifies the second-level columns it will act on, but not 
# the first level, since that was given unambiguously in the df[col] expression
ix = (df["val1"] > 20) | (df["val2"] < 102)
# build an index that specifies both the first and second-level columns
ix2 = pd.concat({"val1": ix}, axis=1)
# or, to do the same assignment on multiple first-level columns:
ix2 = pd.concat({"val1": ix, "val2": ix}, axis=1)
# do the assignment in one step, with no chaining
df[ix2] = 50
# or derive new values from current values
df[ix2] = df[ix2]+50

(2) Avoid using the implicit series.where(..., inplace=True) by using my own .where(..., inplace=False):

ix = (df["val1"] > 20) | (df["val2"] < 102)
df["val1"] = df["val1"].where(~ix, other=50)
df["val2"] = df["val2"].where(~ix, other=50)

# or to assign both columns at once:
# note: this should work with df[["val1", "val2"]] = ..., but pandas 0.18
# doesn't realize that that gives the same set of columns as cols.columns
cols = df[["val1", "val2"]]
df[cols.columns] = cols.where(~ix, other=50)
# or with a calculation:
df[cols.columns] = cols.where(~ix, other=cols+50)

These are both more cumbersome than I'd like, so I may just copy the relevant sections of my dataframe into numpy arrays, then work on them from there. That should have better performance anyway, according to http://penandpants.com/2014/09/05/performance-of-pandas-series-vs-numpy-arrays/ .

Matthias Fripp
  • 17,670
  • 5
  • 28
  • 45