0

I have the following list that records the count frequency of random objects:

counter_obj= [('oranges', 66), ('apple', 13), ('banana', 13), ('pear', 12), ('strawberry', 10), ('watermelon', 10), ('avocado', 8) ... ('blueberry',1),('pineapple',1)]

I'm trying to select eight elements by randomly choosing two objects from each rank quartile.

I tried the following for the first (25%) quartile :

from collections import Counter
dct = {('oranges', 66), ('apple', 13), ('banana', 13), ('pear', 12), ('strawberry', 10), ('watermelon', 10), ('avocado', 8) ... ('blueberry',1),('pineapple',1)}
[tup for tup in Counter(dct).most_common(len(dct)//4)] # 25th percentile by frequency count

How can I do for the rest 2 quartiles 50% and 75% knowing that I have many values at 1 ( they appear only once )
My original data bar plot chart : Bar plot from my original data

Troy
  • 19
  • 5

1 Answers1

1

I would use pandas for this problem:

import pandas as pd

dct = {('oranges', 66), ('apple', 13), ('banana', 13), ('pear', 12), ('strawberry', 10), ('watermelon', 10), ('avocado', 8) , ('blueberry',1),('pineapple',1)}

df = pd.DataFrame(dct, columns = ['Fruit','Count'])  # convert to DataFrame


select = []

for quant in [.25,.5,.75,1]:
  curr_q = df['Count'].quantile(quant)  # this calculates the quantile value
  curr_choice = df[df['Count']<=curr_q].sample(2)  # this selects all rows of your dataframe within current quantile, then samples two of these rows
  select.append(curr_choice)


select = pd.concat(select).reset_index(drop=True)  # concatenates the selected rows to get a nice dataframe, resets the indices.
emilaz
  • 1,722
  • 1
  • 15
  • 31
  • Thank you @emilaz, this answers the question and the example works fine, I was wondering if you took into consideration the fact that I have many objects with frequency = 1 so basically the first and second quantile is the same for my real data, and the third has the value of 2, only the 4th is at 13, any idea how to work around this ? – Troy Apr 14 '20 at 09:30
  • If you want to work with sampling from quantiles, then the fact you mentioned should not matter at all. If you want some kind of even sampling by another rule, you should reconsider working with quantiles. But without knowing what exactly you want to do, I can't help you with that. – emilaz Apr 14 '20 at 09:34
  • I wrote this comment because I keep getting only objects for count = 1, I actually have a list of URL's from google search results with their frequency count ( how many time each URL appears in google search results ) and now I'm trying to get a sub-sample from this data to have a representative sample so I thought about using quantiles instead of most common for example, what do you think? – Troy Apr 14 '20 at 09:46
  • Depends on what you define as representative. I would probably create a simple list where each URL appears as often as its frequency count and then simply sample URLs from that. – emilaz Apr 14 '20 at 09:50