I have a pandas data frame with approximately 5 million rows with 2 columns "top_level_domain" and "category". I would like to like to create a new dataframe with distinct top_level_domain 's and a category column that is comma separated for the unique categories. This dataframe already has categories that are comma separated by nature. Other domains like google will have repeating category, but I only want one.
Dataframe:
df1
top_level_domain category
1 google.com Search Engines
2 service-now.com Business, Software/Hardware
3 google-analytics.com Internet Services
4 live.com None Assigned
5 google.com Content Server
6 google.com Search Engines
7 inspectlet.com Internet Services
8 doubleclick.net Online Shopping, Web Ads
9 google.com Search Engines
10 doubleclick.net Ads
Desired output:
df2
top_level_domain category
1 google.com Search Engines, Content Server
2 service-now.com Business, Software/Hardware
3 google-analytics.com Internet Services
4 live.com None Assigned
7 inspectlet.com Internet Services
8 doubleclick.net Online Shopping, Web Ads, Ads
What is the best way to accomplish this?
I have tried all of the examples from Pandas groupby multiple columns, list of multiple columns
And others like the one below but I am still getting duplicates in the category column.
distinct_category = distinct_category.groupby('top_level_domain')['category'].agg(lambda x: ', '.join(set(x))).reset_index()
But I am getting duplicates in the columns
1 zoho.com Online Shopping, Interactive Web Applications, Interactive Web Applications, Interactive Web Applications, Motor Vehicles
1 zohopublic.com Internet Services, Motor Vehicles, Internet Services, Online Shopping, Internet Services