3

I have a dataFrame really similar to that, but with thousands of values :

import numpy as np
import pandas as pd 

# Setup fake data.
np.random.seed([3, 1415])      
df = pd.DataFrame({
    'Class': list('AAAAAAAAAABBBBBBBBBB'),
    'type': (['short']*5 + ['long']*5) *2,
    'image name': (['image01']*2  + ['image02']*2)*5,
    'Value2': np.random.random(20)})

I was able to find a way to do a random sampling of 2 values per images, per Class and per Type with the following code :

df2 = df.groupby(['type', 'Class', 'image name'])[['Value2']].apply(lambda s: s.sample(min(len(s),2)))

I got the following result :

My table

I'm looking for a way to subset that table to be able to randomly choose a random image ('image name') per type and per Class (and conserve the 2 values for the randomly selected image.

Excel Example of my desired output :

Desired output

rafaelc
  • 57,686
  • 15
  • 58
  • 82
Julien
  • 33
  • 1
  • 6
  • The last part of your question is not clear... can you explain what it is you mean? – cs95 Apr 02 '18 at 23:20
  • In the example above (link "my table") the table has 2 images containing 2 values each, per type and per Class. I would like to be able to transform the table by randomly having 1 image containing 2 values (per type and per class). In the example above, it would be to remove randomly one image for each condition. In my real dataset, I would like to be able to randomly pick "n" images for each condition. I hope that helps – Julien Apr 02 '18 at 23:35

1 Answers1

3

IIUC, the issue is that you do not want to groupby the column image name, but if that column is not included in the groupby, your will lose this column

You can first create the grouby object

gb = df.groupby(['type', 'Class'])

Now you can interate over the grouby blocks using list comprehesion

blocks = [data.sample(n=1) for _,data in gb]

Now you can concatenate the blocks, to reconstruct your randomly sampled dataframe

pd.concat(blocks)

Output

   Class    Value2 image name   type
7      A  0.817744    image02   long
17     B  0.199844    image01   long
4      A  0.462691    image01  short
11     B  0.831104    image02  short

OR

You can modify your code and add the column image name to the groupby like this

df.groupby(['type', 'Class'])[['Value2','image name']].apply(lambda s: s.sample(min(len(s),2)))

                  Value2 image name
type  Class
long  A     8   0.777962    image01
            9   0.757983    image01
      B     19  0.100702    image02
            15  0.117642    image02
short A     3   0.465239    image02
            2   0.460148    image02
      B     10  0.934829    image02
            11  0.831104    image02

EDIT: Keeping image same per group

Im not sure if you can avoid using an iterative process for this problem. You could just loop over the groupby blocks, filter the groups taking a random image and keeping the same name per group, then randomly sample from the remaining images like this

import random

gb = df.groupby(['Class','type'])
ls = []

for index,frame in gb:
    ls.append(frame[frame['image name'] == random.choice(frame['image name'].unique())].sample(n=2))

pd.concat(ls)

Output

   Class    Value2 image name   type
6      A  0.850445    image02   long
7      A  0.817744    image02   long
4      A  0.462691    image01  short
0      A  0.444939    image01  short
19     B  0.100702    image02   long
15     B  0.117642    image02   long
10     B  0.934829    image02  short
14     B  0.721535    image02  short
DJK
  • 8,924
  • 4
  • 24
  • 40
  • Your second example is perfect, but I get a different result when I run it. example : for each group (type/class), I obtain two "values2" from 2 different images. I want the same "image name" for each group. IDK if that makes sense – Julien Apr 03 '18 at 01:46