2

I new to pandas. I have a dataset like this one:

df = pd.DataFrame({'Date':['10/2/2011', '11/2/2011', '12/2/2011', '13/2/2011'],
                'Event':['music.example.com', 'poetry.example.com', 'theatre.domain.com', 'comedy.domain.com'],
                'Cost':[10000, 5000, 15000, 2000]})

And would like to add a column for "base domain" so I can do aggregate functions on base domain instead of subdomain. In this example there new column would have values

'baseDomain':['example.com', 'example.com', 'domain.com', 'domain.com'],

It should not just blindly split on the "." so should probably use something like tld although the domains are not as URLs

========== Update

Used adhg and Henry Ecker solution and did it like this:

def get_base_domain(event):
    ext = tldextract.extract(event)
    return ext.domain + '.' + ext.suffix

df['baseDomain']  = df.apply(lambda x: get_base_domain(x['Event']), axis=1)
Yepher
  • 1,465
  • 12
  • 25
  • 1
    [tldextract](https://pypi.org/project/tldextract/) works with incomplete URLs as well. `df['baseDomain'] = df['Event'].apply(lambda url: tldextract.extract(url).registered_domain)` based on [this answer](https://stackoverflow.com/a/51347816/15497888) – Henry Ecker Mar 01 '22 at 23:30

1 Answers1

2

you can do this:

def get_base_domain(event):
    return event[event.index('.')+1:]

df['baseDomain']  = df.apply(lambda x: get_base_domain(x['Event']), axis=1)

desired result:

         Date   Event               Cost     baseDomain
0   10/2/2011   music.example.com   10000   example.com
1   11/2/2011   poetry.example.com  5000    example.com
2   12/2/2011   theatre.domain.com  15000   domain.com
3   13/2/2011   comedy.domain.com   2000    domain.com

adjust get_base_domain if you have unclean Event domain data

adhg
  • 10,437
  • 12
  • 58
  • 94
  • Thanks that is very helpful I am sure I can work out the solution from there. One issue I found is when I run that where one of the domains in the data dataset is `quora.com` it return just `com` instead of `quora.com`. I should have said the "Event" can be any valid domain or any level of subdomain. – Yepher Mar 01 '22 at 23:33
  • 1
    @Yepher this is exactly why I added the function so you can get other exceptions. So just improve it such that if theres only 1 dot - return 'invalid' or whatever – adhg Mar 01 '22 at 23:38
  • @Yepher do this in the func: if event.count('.')==1: return event – adhg Mar 01 '22 at 23:41
  • I did this ext = tldextract.extract('http://' + domain) return ext.domain + '.' + ext.suffix – Yepher Mar 02 '22 at 00:23
  • @Yepher even better ;-) – adhg Mar 02 '22 at 00:44