1

I want to randomly pick-up i.e. 10 groups that I have in a dataframe, but i'm stuck with this error. What can I do if I want to apply a groupby before the random selection? I try the following approaches: random_selection=tot_groups.groupby('query_col').apply(lambda x: x.sample(3)) random_selection=tot_groups.groupby('query_col').sample(n=10)

Error: ValueError: Cannot take a larger sample than population when 'replace=False'

Thanks !

UPDATE:

Current dataset

ABG23209.1,UBH04469.1,89.655,145,15,0,1,145,19,163,3.63e-100,275.0
ABG23209.1,UBH04470.1,89.655,145,15,0,1,145,20,164,4.68e-100,275.0
ABG23209.1,UBH04471.1,89.655,145,15,0,1,145,19,163,4.83e-100,275.0
ABG23209.1,UBH04472.1,89.655,145,15,0,1,145,24,168,5.58e-100,275.0
KOX89835.1,SFN69046.1,79.07,86,18,0,1,86,12,97,1.36e-49,143.0
KOX89835.1,SFE98714.1,77.907,86,19,0,1,86,19,104,2.1400000000000002e-49,143.0
KOX89835.1,WP_086938959.1,76.471,85,20,0,1,85,4,88,1.25e-48,140.0
KOX89835.1,WP_231794161.1,76.471,85,20,0,1,85,5,89,1.75e-48,140.0
KOX89835.1,WP_231794169.1,75.294,85,21,0,1,85,5,89,2.41e-48,140.0
WP_001287378.1,QBP98897.1,86.765,136,17,1,1,135,1,136,1.68e-85,241.0
WP_001287378.1,WP_005164157.1,86.765,136,17,1,1,135,1,136,1.68e-85,241.0
WP_001287378.1,WP_085071573.1,86.667,135,18,0,1,135,1,135,1.73e-85,241.0
WP_001287378.1,WP_014608965.1,86.765,136,17,1,1,135,1,136,2.49e-85,240.0
WP_001287378.1,WP_004932170.1,86.667,135,18,0,1,135,1,135,6.88e-78,221.0
WP_001287378.1,GGD19357.1,91.912,136,10,1,1,136,1,135,1.01e-77,221.0
WP_001287378.1,OMQ27200.1,85.926,135,19,0,1,135,1,135,1.79e-77,221.0
XP_037955766.1,WP_229689219.1,93.583,374,24,0,3,376,5,378,0.0,745.0
XP_037955766.1,WP_229799179.1,93.583,374,24,0,3,376,1,374,0.0,744.0
XP_037955766.1,WP_017454560.1,92.308,377,28,1,1,376,1,377,0.0,738.0
XP_037955766.1,WP_108127780.1,92.838,377,26,1,1,376,1,377,0.0,736.0

Desidered output: Randomly select n groups in the dataframe, groupby query_col . I.e. with n=2:

WP_001287378.1,QBP98897.1,86.765,136,17,1,1,135,1,136,1.68e-85,241.0
WP_001287378.1,WP_005164157.1,86.765,136,17,1,1,135,1,136,1.68e-85,241.0
WP_001287378.1,WP_085071573.1,86.667,135,18,0,1,135,1,135,1.73e-85,241.0
WP_001287378.1,WP_014608965.1,86.765,136,17,1,1,135,1,136,2.49e-85,240.0
WP_001287378.1,WP_004932170.1,86.667,135,18,0,1,135,1,135,6.88e-78,221.0
WP_001287378.1,GGD19357.1,91.912,136,10,1,1,136,1,135,1.01e-77,221.0
WP_001287378.1,OMQ27200.1,85.926,135,19,0,1,135,1,135,1.79e-77,221.0
ABG23209.1,UBH04469.1,89.655,145,15,0,1,145,19,163,3.63e-100,275.0
ABG23209.1,UBH04470.1,89.655,145,15,0,1,145,20,164,4.68e-100,275.0
ABG23209.1,UBH04471.1,89.655,145,15,0,1,145,19,163,4.83e-100,275.0
ABG23209.1,UBH04472.1,89.655,145,15,0,1,145,24,168,5.58e-100,275.0
Chiara
  • 372
  • 5
  • 17

1 Answers1

1

groupby's sample returns n element from each group. If the group doesn't contain at least n element, you'll get the error.

To select groups randomly, you count how many groups there are, then sample (without replacement) n numbers in the range [0,number of groups), and then return those lines, where the group's group number is equal to the sampled random numbers.

import random
import pandas as pd

random.seed(0)

tot_groups = pd.read_csv("data.csv",header=None).rename(columns={0:"query_col"})
grouped = tot_groups.groupby("query_col")  # suppose you want to use this

group_selectors = random.sample(range(grouped.ngroups), k=2)
ret_df = tot_groups[grouped.ngroup().isin(group_selectors)]

print(ret_df)

However, there is no need to create any groupby object. You can collect the list of different query_col values, sample them, and return those lines, where the query_col has the right value:

import random
import pandas as pd

random.seed(0)

tot_groups = pd.read_csv("data.csv",header=None).rename(columns={0:"query_col"})
unique_queries = tot_groups["query_col"].unique().tolist()
selected_queries = random.sample(unique_queries,k=2)

ret_df = tot_groups[tot_groups["query_col"].isin(selected_queries)]

print(ret_df)
DanielTuzes
  • 2,494
  • 24
  • 40