I have a data frame of 1 million records with 5 columns.
unique_index,name,company_name,city_id,state_id
Column, company_name
, has 100k unique records. This follows a power law. Top 5000 company_names
cover 70% of the records.
I want to take equal number of samples from the companies which contribute to the top 5000 of the data and from the remaining set.
I tried pd.qcut(df['company_name'],[0.25,1]
. This gave me the below error:
TypeError: unorderable types: str() <= float()
. Can qcut
not be applied to strings?