-2

I have a data set like this

1A1HI_R071_PH_INSPECT_VIS_1_2_201231_025816.JPG 1A  1A1HI
1A1JK_R071_PH_INSPECT_VIS_1_2_210115_121554.JPG 1A  1A1JK
1P3G6_R071_PH_INSPECT_VIS_2_2_201231_034741.JPG 1P  1P3G6
1P3GC_R071_PH_INSPECT_VIS_3_2_201107_140047.JPG 1P  1P3GC
M10L0_R071_PH_INSPECT_VIS_6_2_201121_071741.JPG M1  M10L0
M10L4_R071_PH_INSPECT_VIS_8_2_201201_142646.JPG M1  M10L4
S5148_R071_PH_INSPECT_VIS_1_2_201127_042210.JPG S5  S5148
S516U_R071_PH_INSPECT_VIS_5_2_201222_074443.JPG S5  S516U
V929S_R071_PH_INSPECT_VIS_8_2_201120_144633.JPG V9  V929S
V92B0_R071_PH_INSPECT_VIS_4_2_201121_095537.JPG V9  V92B0
V92B0_R071_PH_INSPECT_VIS_4_2_201121_095539.JPG V9  V92B0
V92EM_R071_PH_INSPECT_VIS_2_2_210105_133406.JPG V9  V92EM
W405K_R071_PH_INSPECT_VIS_11_2_201021_230940.JPG    W4  
W405O_R071_PH_INSPECT_VIS_2_2_201206_095433.JPG W4  W405O
W40EW_R071_PH_INSPECT_VIS_3_3_201219_120634.JPG W4  W40EW
W40EW_R071_PH_INSPECT_VIS_5_3_201220_072010.JPG W4  W40EW
W40EW_R071_PH_INSPECT_VIS_5_3_201220_072019.JPG W4  W40EW
X103K_R071_PH_INSPECT_VIS_2_3_210112_185054.JPG X1  X103K
1A1HI_R071_PH_INSPECT_VIS_1_4_201231_025833.JPG 1A  1A1HI
1A1RE_R071_PH_INSPECT_VIS_1_4_201227_153637.JPG 1A  1A1RE
1P3G6_R071_PH_INSPECT_VIS_2_4_201231_034806.JPG 1P  1P3G6
1P3GC_R071_PH_INSPECT_VIS_3_4_201107_140102.JPG 1P  1P3GC
1P3HO_R071_PH_INSPECT_VIS_6_4_201214_113511.JPG 1P  1P3HO
1P3HQ_R071_PH_INSPECT_VIS_5_4_201207_191653.JPG 1P  1P3HQ
5X6X6_R071_PH_INSPECT_VIS_3_4_201211_142453.JPG 5X  5X6X6
A70NG_R071_PH_INSPECT_VIS_5_4_201025_182537.JPG A7  A70NG
M10L0_R071_PH_INSPECT_VIS_6_4_201121_071750.JPG M1  M10L0
M10L4_R071_PH_INSPECT_VIS_8_4_201201_142701.JPG M1  M10L4
V929S_R071_PH_INSPECT_VIS_8_4_201120_144651.JPG V9  V929S
V92EM_R071_PH_INSPECT_VIS_2_4_210105_133438.JPG V9  V92EM
W405O_R071_PH_INSPECT_VIS_2_4_201206_095500.JPG W4  W405O
W4078_R071_PH_INSPECT_VIS_5_4_201215_153919.JPG W4  W4078
W40BK_R071_PH_INSPECT_VIS_2_4_210113_175802.JPG W4  W40BK
W40EW_R071_PH_INSPECT_VIS_5_4_201220_072024.JPG W4  W40EW
1A1HI_R071_PH_INSPECT_VIS_1_5_201231_025836.JPG 1A  1A1HI
1A1JK_R071_PH_INSPECT_VIS_1_5_210115_121617.JPG 1A  1A1JK
1A1RE_R071_PH_INSPECT_VIS_1_5_201227_153639.JPG 1A  1A1RE
1P3G6_R071_PH_INSPECT_VIS_2_5_201231_034809.JPG 1P  1P3G6
1P3GC_R071_PH_INSPECT_VIS_3_5_201107_140105.JPG 1P  1P3GC

The first column is image name, the second column is the product name

There are these many products in the dataset. enter image description here

How can I select Images(1st column) based on the 2nd column percentage of occurrence?

For example, I need to select 170 random rows(images) of 1st column that contain 1P in the 2nd column, 156 random rows that contain 1A so on to get 20% of images in each product category to build a training set

Chris
  • 15,819
  • 3
  • 24
  • 37
Arjun
  • 49
  • 9

2 Answers2

2

You can use DataFrameGroupBy.sample to sample rows in each category.

n = 0.2  # 20% per category
# Sample dataframe
df = pandas.DataFrame({
        'image_id': [1,2,3,4,5,6,7],
        'product_category': ['A', 'A', 'A', 'A', 'A', 'B', 'B']
})

df.groupby('product_category').sample(frac=n)

However, please note that some category may return no rows if their sampled count falls below 1.

AnkurSaxena
  • 825
  • 7
  • 11
  • I have a similar issue, and tried your solution. I get the error, 'Cannot access callable attribute 'sample' of 'DataFrameGroupBy' objects, try using the 'apply' method'. I've seen this: https://stackoverflow.com/questions/53406056/attributeerror-cannot-access-callable-attribute-groupby-of-dataframegroupby, but it didn't help (my data has not been grouped before). Any idea how to solve this? – johnjohn Jan 31 '21 at 10:11
1

You can sample each group with a lambda as follows:

import pandas as pd
df = pd.read_csv("example.csv")
df.columns = ["image", "product"]

grouped = df.groupby("product")
result = grouped.apply(lambda x: x.sample(frac=0.2))
pakpe
  • 5,391
  • 2
  • 8
  • 23