1

I am fairly new to python and i would like to sample sets of data in the following dataframe by their group, without selecting the same group twice. The code i have written does sample the sets of data correctly, however, it can select the same set twice.

please note: the following data is testing data and the actual data i am using the code on is much larger in size and therefore using indexes will not be possible.

DATA:

d={'group': ['A','A','A','B','B','B','C','C','C','D','D','D','E','E','E'], 'number': [1,2,3,1,2,3,1,2,3,1,2,3,1,2,3],'weather':['hot','hot','hot','cold','cold','cold','hot','hot','hot','cold','cold','cold','hot','hot','hot']}```
df = pd.DataFrame(data=d)
df
group   number  weather
A       1       hot
A       2       hot
A       3       hot
B       1       cold
B       2       cold
B       3       cold
C       1       hot
C       2       hot
C       3       hot
D       1       cold
D       2       cold
D       3       cold
E       1       hot
E       2       hot
E       3       hot

MY CODE

df_s=[]
for typ in df.group.sample(3,replace=False):
    df_s.append(df[df['group']==typ])
df_s=pd.concat(df_s)
df_s

OUTCOME

group   number  weather
E       1       hot
E       2       hot
E       3       hot
E       1       hot
E       2       hot
E       3       hot
D       1       cold
D       2       cold
D       3       cold

The outcome should give 3 different groups data however as can be seen there is only 2 (E & D) meaning the code can select the same group more than once.

d.patel
  • 13
  • 3
  • You can use `np.random.seed(32)` to make the data reproducible. You can replace the value `32` with any value you would like. So if you share your code or if you re-run the code, you will generate the same random sample of data. – gernworm Jul 20 '21 at 13:43

1 Answers1

1

Method sample used with argument replace=False will ensure, that you have no row duplicates in created sample df. However you do have several rows with the same letter denoting group (your column group).

For just quickfixing your code:

df_s=[]
for typ in np.random.choice(df["group"].unique(), 3, replace=False):
    df_s.append(df[df['group']==typ])
df_s=pd.concat(df_s)
df_s
LockeErasmus
  • 116
  • 3