1

I'm working on a data suppression script in python where I need to 1) suppress small values (between 1 and 5) and 2) make sure that there are at least 2 values suppressed at the smallest level of aggregation. I've done the first step, replacing small values with -1 (which I'll later recode to "s"). And I created a new helper column that counts how many suppressed values there are per row ('sup_cnt'). That yields something like this:

Subgroup   cat1    cat2    cat3    sup_cnt
Group1      0      -1       0       1
Group2     -1      22       6       1
Group3     -1      14      -1       2
Group4     -1      -1       0       2

data = {'group':['group1','group2','group3','group4'],'cat1':[0,-1,-1,-1],'cat2':[-1,22,14,-1],'cat3':[0,0,-1,0],'sup_cnt':[1,1,2,3]}
df = pd.DataFrame(data)

So for Group1 and Group2, which only have one value suppressed, I want a second value -- the lowest (including zeroes) -- to be replaced with -1. In Group1, one of the zeroes would be replaced; in Group2, 6 would be replaced. So the result would be like this:

Subgroup   cat1    cat2    cat3    sup_cnt
Group1     -1      -1       0       1
Group2     -1      22      -1       1
Group3     -1      14      -1       2
Group4     -1      -1       0       2

If there are more than one columns with the same lowest value (like with Group1, which has 2 zeroes), I only want one of those to be replaced (doesn't matter which).

Originally started this in R and switched to python/pandas (but I'm new to pandas). My idea was to write a function that takes the cat values as arguments, determines the minimum non-negative integer among those, loops through the data columns in a row and replaces the first instance of that min value in the row, then breaks. Not sure if that's the right approach though (or exactly how to carry it out). Any ideas?

s_bklyn
  • 77
  • 1
  • 6

1 Answers1

1

I hope I've understood your question right:

def fn(x):
    cols = x.filter(regex=r"^cat")
    x = cols[cols >= 0].sort_values()[: 2 - x["sup_cnt"]]
    df.loc[x.name, x.index] = -1


df[df.sup_cnt < 2].apply(fn, axis=1)
print(df)

Prints:

  Subgroup  cat1  cat2  cat3  sup_cnt
0   Group1    -1    -1     0        1
1   Group2    -1    22    -1        1
2   Group3    -1    14    -1        2
3   Group4    -1    -1     0        2
Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91
  • Thank you @Andrej Kesely. This worked perfectly. The only change I made is `df[df.sup_cnt == 1]` so that it wouldn't run on rows with no suppression at all (that prob wasn't clear from me). The only part I don't quite grasp is: `[: 2 - x["sup_cnt"]]` – s_bklyn Nov 23 '22 at 22:17
  • @s_bklyn This will get first 1 or 2 elements from the sorted values. But if you changed the `df.sup_cnt == 1` this will get first element every time. So you can change it to `[:1]` – Andrej Kesely Nov 23 '22 at 22:22
  • 1
    Oh I see how it works now--if it was sup_cnt == 0, it would pull first two and if ==1, just the first. In my case I don't need any suppression if no values were <6. Thanks again. – s_bklyn Nov 23 '22 at 23:31