1

Python newbie here. Imagine a csv file that looks something like this:

enter image description here

(...except that in real life, there are 20 distinct names in the Person column, and each Person has 300-500 rows. Also, there are multiple data columns, not just one.)

What I want to do is randomly flag 10% of each Person's rows and mark this in a new column. I came up with a ridiculously convoluted way to do this--it involved creating a helper column of random numbers and all sorts of unnecessarily complicated jiggery-pokery. It worked, but was crazy. More recently, I came up with this:

import pandas as pd 
df = pd.read_csv('source.csv')
df['selected'] = ''

names= list(df['Person'].unique())  #gets list of unique names

for name in names:
     df_temp = df[df['Person']== name]
     samp = int(len(df_temp)/10)   # I want to sample 10% for each name
     df_temp = df_temp.sample(samp)
     df_temp['selected'] = 'bingo!'   #a new column to mark the rows I've randomly selected
     df = df.merge(df_temp, how = 'left', on = ['Person','data'])
     df['temp'] =[f"{a} {b}" for a,b in zip(df['selected_x'],df['selected_y'])]
        #Note:  initially instead of the line above, I tried the line below, but it didn't work too well:
        #df['temp'] = df['selected_x'] + df['selected_y']
     df = df[['Person','data','temp']]
     df = df.rename(columns = {'temp':'selected'})

df['selected'] = df['selected'].str.replace('nan','').str.strip()  #cleans up the column

As you can see, essentially I'm pulling out a temporary DataFrame for each Person, using DF.sample(number) to do the randomising, then using DF.merge to get the 'marked' rows back into the original DataFrame. And it involved iterating through a list to create each temporary DataFrame...and my understanding is that iterating is kind of lame.

There's got to be a more Pythonic, vectorising way to do this, right? Without iterating. Maybe something involving groupby? Any thoughts or advice much appreciated.

EDIT: Here's another way that avoids merge...but it's still pretty clunky:

import pandas as pd
import math
    
   #SETUP TEST DATA:
    y = ['Alex'] * 2321 + ['Doug'] * 34123  + ['Chuck'] * 2012 + ['Bob'] * 9281 
    z = ['xyz'] * len(y)
    df = pd.DataFrame({'persons': y, 'data' : z})
    df = df.sample(frac = 1) #shuffle (optional--just to show order doesn't matter)
    percent = 10  #CHANGE AS NEEDED
    
    #Add a 'helper' column with random numbers
    df['rand'] = np.random.random(df.shape[0])
    df = df.sample(frac=1)  #this shuffles data, just to show order doesn't matter
    
    #CREATE A HELPER LIST
    helper = pd.DataFrame(df.groupby('persons'['rand'].count()).reset_index().values.tolist()
    for row in helper:
        df_temp = df[df['persons'] == row[0]][['persons','rand']]
        lim = math.ceil(len(df_temp) * percent*0.01)
        row.append(df_temp.nlargest(lim,'rand').iloc[-1][1])
               
    def flag(name,num):
        for row in helper:
            if row[0] == name:
                if num >= row[2]:
                    return 'yes'
                else:
                    return 'no'
    
    df['flag'] = df.apply(lambda x: flag(x['persons'], x['rand']), axis=1)
P E
  • 165
  • 11

3 Answers3

2

You could use groupby.sample, either to pick out a sample of the whole dataframe for further processing, or to identify rows of the dataframe to mark if that's more convenient.

import pandas as pd

percentage_to_flag = 0.5

# Toy data: 8 rows, persons A and B.
df = pd.DataFrame(data={'persons':['A']*4 + ['B']*4, 'data':range(8)})
#   persons  data
# 0       A     0
# 1       A     1
# 2       A     2
# 3       A     3
# 4       B     4
# 5       B     5
# 6       B     6
# 7       B     7

# Pick out random sample of dataframe.
random_state = 41  # Change to get different random values.
df_sample = df.groupby("persons").sample(frac=percentage_to_flag,
                                         random_state=random_state)
#   persons  data
# 1       A     1
# 2       A     2
# 7       B     7
# 6       B     6

# Mark the random sample in the original dataframe.
df["marked"] = False
df.loc[df_sample.index, "marked"] = True
#   persons  data  marked
# 0       A     0   False
# 1       A     1    True
# 2       A     2    True
# 3       A     3   False
# 4       B     4   False
# 5       B     5   False
# 6       B     6    True
# 7       B     7    True

If you really do not want the sub-sampled dataframe df_sample you can go straight to marking a sample of the original dataframe:

# Mark random sample in original dataframe with minimal intermediate data.
df["marked2"] = False
df.loc[df.groupby("persons")["data"].sample(frac=percentage_to_flag,
                                            random_state=random_state).index,
       "marked2"] = True
#   persons  data  marked  marked2
# 0       A     0   False    False
# 1       A     1    True     True
# 2       A     2    True     True
# 3       A     3   False    False
# 4       B     4   False    False
# 5       B     5   False    False
# 6       B     6    True     True
# 7       B     7    True     True
TMBailey
  • 557
  • 3
  • 14
  • Thanks, sounds promising but I get "AttributeError: 'DataFrameGroupBy' object has no attribute 'sample'". Maybe I can tweak to get past this... – P E Oct 19 '21 at 06:38
  • Might be a pandas version issue. Maybe see (this previous question)(https://stackoverflow.com/questions/36390406/sample-each-group-after-pandas-groupby). – TMBailey Oct 19 '21 at 06:49
  • Yes, i had to change it to:"df_sample = df.groupby("persons") /// df_sample = df_sample.apply(lambda x: x.sample(frac=percentage_to_flag,random_state=random_state))"...and that now works. But now I'm getting stuck further down: "KeyError: "None of [MultiIndex([('A', 1),\n ('A', 2),\n ('B', 5),\n ('B', 6)],\n names=['persons', None])] are in the [index]", so I have to figure that out...but I think I can fix it by adding: df_sample = df_sample.reset_index(level=0, drop=True) – P E Oct 19 '21 at 08:35
1

If I understood you correctly, you can achieve this using:

df = pd.DataFrame(data={'persons':['A']*10 + ['B']*10, 'col_1':[2]*20})
percentage_to_flag = 0.5
a = df.groupby(['persons'])['col_1'].apply(lambda x: pd.Series(x.index.isin(x.sample(frac=percentage_to_flag, random_state= 5, replace=False).index))).reset_index(drop=True)
df['flagged'] = a

Input:

       persons  col_1
    0        A      2
    1        A      2
    2        A      2
    3        A      2
    4        A      2
    5        A      2
    6        A      2
    7        A      2
    8        A      2
    9        A      2
    10       B      2
    11       B      2
    12       B      2
    13       B      2
    14       B      2
    15       B      2
    16       B      2
    17       B      2
    18       B      2
    19       B      2

Output with 50% flagged rows in each group:

     persons  col_1  flagged
0        A      2    False
1        A      2    False
2        A      2     True
3        A      2    False
4        A      2     True
5        A      2     True
6        A      2    False
7        A      2     True
8        A      2    False
9        A      2     True
10       B      2    False
11       B      2    False
12       B      2     True
13       B      2    False
14       B      2     True
15       B      2     True
16       B      2    False
17       B      2     True
18       B      2    False
19       B      2     True
Muhammad Hassan
  • 4,079
  • 1
  • 13
  • 27
  • Thanks...this seems very much on the right track, but that code doesn't give me the output shown above. Instead, 'a' is a series containing 2 ndarrays, and the df's new 'flagged' column contains 1 array from the series in the first row, the second row contains the 2nd array, and the remaining rows=nan. (For example, the first row in df column 'flagged' = [True False False False True False False False True False]--presumably this will change every time it's run because of the x.sample.) Not sure why it's behaving differently for me, but this does seem like it's getting very close... – P E Oct 17 '21 at 08:55
  • You can set `random_state = any_integer` inside `x.sample()` if you want to generate same random flags each time you run the code – Muhammad Hassan Oct 17 '21 at 09:18
  • As for the other issue, sorry there was a small mistake in the code, I have edited that as well. You can try it now, it will generate same output for you as I have set random state as well – Muhammad Hassan Oct 17 '21 at 09:23
  • Thanks...this seems to work for the sample data you created, but I don't think it quite works when I change the data. For example: y = ['Alex'] * 2321 + ['Doug'] * 34123 + ['Chuck'] * 2012 + ['Bob'] * 9281 /// z = ['xyz'] * len(y) /// df = pd.DataFrame({'persons': y, 'data' : z}) – P E Oct 19 '21 at 05:39
  • `a = df.groupby(['persons'])['data'].apply(lambda x: pd.Series(x.index.isin(x.sample(frac=percentage_to_flag, random_state= 5, replace=False).index))).reset_index(drop=True) ` and then `df['flagged'] = a`. Its working fine for me – Muhammad Hassan Oct 19 '21 at 07:00
  • It's close but not quite close enough--sometimes it undershoots or overshoots. Here's what I get when set at 10%: Alex 10.00%, Bob 9.56%, Chuck 11.38%, Doug 10.04%. I really need ALL to be at least 10% (but not overshoot either--one of these is >11%.) – P E Oct 19 '21 at 07:49
1

This is TMBailey's answer, tweaked so it works in my Python version. (Didn't want to edit someone else's answer but if I'm doing it wrong I'll take this down.) This works really great and really fast!

EDIT: I've updated this based on additional suggestion by TMBailey to replace frac=percentage_to_flag with n=math.ceil(percentage_to_flag * len(x)). This ensures that rounding doesn't pull the sampled %age under the 'percentage_to_flag' threshhold. (For what it's worth, you can replace it with frac=(math.ceil(percentage_to_flag * len(x)))/len(x) too).

import pandas as pd
import math

percentage_to_flag = .10

# Toy data:
y = ['Alex'] * 2321 + ['Eddie'] * 876 + ['Doug'] * 34123  + ['Chuck'] * 2012 + ['Bob'] * 9281 
z = ['xyz'] * len(y)
df = pd.DataFrame({'persons': y, 'data' : z})
df = df.sample(frac = 1) #optional shuffle, just to show order doesn't matter

# Pick out random sample of dataframe.
random_state = 41  # Change to get different random values.
df_sample = df.groupby("persons").apply(lambda x: x.sample(n=(math.ceil(percentage_to_flag * len(x))),random_state=random_state))
#had to use lambda in line above
df_sample = df_sample.reset_index(level=0, drop=True)  #had to add this to simplify multi-index DF

# Mark the random sample in the original dataframe.
df["marked"] = False
df.loc[df_sample.index, "marked"] = True

And then to check:

    pp = df.pivot_table(index="persons", columns="marked", values="data", aggfunc='count', fill_value=0)
    pp.columns = ['no','yes']
    pp = pp.append(pp.sum().rename('Total')).assign(Total=lambda d: d.sum(1))
    pp['% selected'] = 100 * pp.yes/pp.Total
    print(pp)
    
    OUTPUT:
            no   yes  Total  % selected
persons                                
Alex      2088   233   2321   10.038776
Bob       8352   929   9281   10.009697
Chuck     1810   202   2012   10.039761
Doug     30710  3413  34123   10.002051
Eddie      788    88    876   10.045662
Total    43748  4865  48613   10.007611

Works like a charm.

P E
  • 165
  • 11
  • When you call `x.sample` instead of passing `frac` you could pass something like `n=math.ceil(percentage_to_flag * len(x))`. – TMBailey Oct 19 '21 at 17:35
  • Yes, that's brilliant! Instead of `frac=percentage_to_flag` I can use `frac=(math.ceil(percentage_to_flag * len(x)))/len(x)` and every one of them comes in at >10%! Thanks--I'll edit/update the code ... – P E Oct 19 '21 at 20:47
  • When I looked at the help for `df.sample` it described a parameter `n` that can be given instead of `frac`. When I looked at the source code, when `frac` is given it is used to calculate `n` anyway (`n = round(frac * axis_length)`). So if you wanted you could call `x.sample(n=...)`. Unless there's another issue with pandas versions? – TMBailey Oct 20 '21 at 08:55
  • Oh yes, I see what you mean! I gather `n` is a number and `frac` is a proportion or fraction--so using `n` is simpler, at least in this situation. Thanks again! – P E Oct 20 '21 at 13:04