Urlparse applied to a column for extracting length and TLD info

Question

I'm trying to extract length and suffix (tld) from a list of websites in a pandas data frame.

Website.      Label
18egh.com       1
fish.co.uk      0
www.description.com 1
http://world.com 1

My desired output should be

Website      Label    Length   Tld 
18egh.com       1        5      com
fish.co.uk      0        4      co.uk
www.description.com 1    11     com
http://world.com 1       5      com

I've tried first with the length as shown as follows:

def get_domain(df):  
    my_list=[]
    for x in df['Website'].tolist():
          domain = urlparse(x).netloc
          my_list.append(domain)
          df['Domain']  = my_list
          df['Length']=df['Domain'].str.len()
    return df

but when I check the list is empty. I know that for extracting information on domain and tld it'd probably enough to use url parse, but if I am wrong I'd appreciate if you'd point me on the right direction.

MDR · Accepted Answer · 2021-08-29T17:06:29.947

Update:

To extract the domains, etc. try tldextract to do the work.

Example:

import pandas as pd
import tldextract # pip install tldextract | # conda install -c conda-forge tldextract

df = pd.DataFrame({'Website.': {0: '18egh.com',
  1: 'fish.co.uk',
  2: 'www.description.com',
  3: 'http://world.com',
  4: 'http://forums.news.cnn.com/'},
 'Label': {0: 1, 1: 0, 2: 1, 3: 1, 4: 0}})

df[['subdomin', 'domain', 'suffix']] = df.apply(lambda x: pd.Series(tldextract.extract(x['Website.'])), axis=1)

print(df)

                          Website.  Label     subdomin       domain suffix
    0                    18egh.com      1                     18egh    com
    1                   fish.co.uk      0                      fish  co.uk
    2          www.description.com      1          www  description    com
    3             http://world.com      1                     world    com
    4  http://forums.news.cnn.com/      0  forums.news          cnn    com

Original answer below

Try:

import pandas as pd

df = pd.DataFrame({'Website.': {0: '18egh.com',
  1: 'fish.co.uk',
  2: 'www.description.com',
  3: 'http://world.com'},
 'Label': {0: 1, 1: 0, 2: 1, 3: 1}})

pattern = r'(?:https?:\/\/|www\.|https?:\/\/www\.)?(.*?)\.'

df['Domain'] = df['Website.'].str.extract(pattern)
df['Domain_Len'] = df['Domain'].str.len()

print(df)

    Website.             Label  Domain          Domain_Len
0   18egh.com            1      18egh           5
1   fish.co.uk           0      fish            4
2   www.description.com  1      description     11
3   http://world.com     1      world           5

Alternatively:

pattern = r'(?:https?:\/\/|www\.|https?:\/\/www\.)?(.*?)\.(.*?)$'

df[['Domain', 'TLD']] = df['Website.'].str.extract(pattern, expand=True)
df['TLD_Len'] = df['TLD'].str.len()
df['Domain_Len'] = df['Domain'].str.len()

print(df)

    Website.             Label  TLD     TLD_Len     Domain       Domain_Len
0   18egh.com            1      com     3           18egh        5
1   fish.co.uk           0      co.uk   5           fish         4
2   www.description.com  1      com     3           description  11
3   http://world.com     1      com     3           world        5

Urlparse applied to a column for extracting length and TLD info

1 Answers1