0

I have a data frame of 1 million records with 5 columns.

unique_index,name,company_name,city_id,state_id

Column, company_name, has 100k unique records. This follows a power law. Top 5000 company_names cover 70% of the records.

Power law

I want to take equal number of samples from the companies which contribute to the top 5000 of the data and from the remaining set.

I tried pd.qcut(df['company_name'],[0.25,1]. This gave me the below error: TypeError: unorderable types: str() <= float(). Can qcut not be applied to strings?

James Z
  • 12,209
  • 10
  • 24
  • 44
pnv
  • 1,437
  • 3
  • 23
  • 52

1 Answers1

1

You could try grabbing the top companies by value_counts() and then creating a new column with True/False if it's in/out of the top companies. I think it would look something like this:

top5000 = df['company_name'].value_counts().index[0:5000].tolist()
df['InTop'] = df['company_name'].isin(top5000)

This would allow you to sample from the group where df['InTop'] == True and the group where df['InTop'] == False

nanojohn
  • 572
  • 1
  • 3
  • 13