3

I have a pandas data frame with approximately 5 million rows with 2 columns "top_level_domain" and "category". I would like to like to create a new dataframe with distinct top_level_domain 's and a category column that is comma separated for the unique categories. This dataframe already has categories that are comma separated by nature. Other domains like google will have repeating category, but I only want one.

Dataframe:

df1
    top_level_domain      category
1   google.com            Search Engines
2   service-now.com       Business, Software/Hardware
3   google-analytics.com  Internet Services
4   live.com              None Assigned
5   google.com            Content Server
6   google.com            Search Engines
7   inspectlet.com        Internet Services
8   doubleclick.net       Online Shopping, Web Ads
9   google.com            Search Engines
10  doubleclick.net       Ads

Desired output:

df2
    top_level_domain      category
1   google.com            Search Engines, Content Server
2   service-now.com       Business, Software/Hardware
3   google-analytics.com  Internet Services
4   live.com              None Assigned
7   inspectlet.com        Internet Services
8   doubleclick.net       Online Shopping, Web Ads, Ads

What is the best way to accomplish this?

I have tried all of the examples from Pandas groupby multiple columns, list of multiple columns

And others like the one below but I am still getting duplicates in the category column.

distinct_category = distinct_category.groupby('top_level_domain')['category'].agg(lambda x: ', '.join(set(x))).reset_index()

But I am getting duplicates in the columns

1   zoho.com    Online Shopping, Interactive Web Applications, Interactive Web Applications, Interactive Web Applications, Motor Vehicles
1   zohopublic.com  Internet Services, Motor Vehicles, Internet Services, Online Shopping, Internet Services
sectechguy
  • 2,037
  • 4
  • 28
  • 61

3 Answers3

4

First split column by comma, then groupby by column and use generator for flatten nested lists with set and join:

df = (distinct_category['category'].str.split(', ')
                    .groupby(distinct_category['top_level_domain'])
                    .agg(lambda x: ', '.join(set(y for z in x for y in z)))
                    .reset_index())
print (df)

       top_level_domain                        category
0       doubleclick.net   Ads, Online Shopping, Web Ads
1  google-analytics.com               Internet Services
2            google.com  Content Server, Search Engines
3        inspectlet.com               Internet Services
4              live.com                   None Assigned
5       service-now.com     Business, Software/Hardware

Another solution is assign back splitted values:

df = (distinct_category.assign(category = distinct_category['category'].str.split(', '))
                       .groupby('top_level_domain')['category']
                       .agg(lambda x: ', '.join(set(y for z in x for y in z)))
                       .reset_index())
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
2

First expand your dataframe, so each row contains one category only:

split = df['category'].str.split(', ')
lens = split.str.len()

df = pd.DataFrame({'top_level_domain': np.repeat(df['top_level_domain'].values, lens),
                   'category': np.concatenate(split)})

Then remove duplicates and use agg with str.join:

res = df.drop_duplicates()\
        .groupby('top_level_domain')['category'].agg(','.join)
jpp
  • 159,742
  • 34
  • 281
  • 339
0

The following code works for me:

df =df.groupby('top_level_domain')['category'].agg([('category', ', '.join)]).reset_index()
Loochie
  • 2,414
  • 13
  • 20