6

I have this df:

dput(df)
structure(list(URLs = c("http://bursesvp.ro//portal/user/_/Banco_Votorantim_Cartoes/0-7f2f5cb67f1-22918b.html", 
"http://46.165.216.78/.CartoesVotorantim/Usuarios/Cadastro/BV6102891782/", 
"http://www.chalcedonyhotel.com/images/promoc/premiado.tam.fidelidade/", 
"http://bmbt.ro/portal/a3/_Votorantim_/VotorantimCartoes2016/0-7f2f5cb67f1-22928b.html", 
"http://voeazul.nl/azul/")), .Names = "URLs", row.names = c(NA, 
-5L), class = "data.frame")

It describes different URLs and I am trying to count the number of characters of the host name, whether that is an actual name(http://hostname.com/....) or an IP(http://000.000.000.000/...). However, if it is an actual name, then I only want the nchar between www. and .com. If it's an IP then all its numbers and "in between" dots.

Expected Outcome for the above sample data:

exp_outcome
1           8
2          13
3          15
4           4
5           7

I tried to do something with strsplit but could not get anywhere.

Sotos
  • 51,121
  • 6
  • 32
  • 66

3 Answers3

8

Another, maybe more direct way with a different regex:

nchar(sub("^http://(www\\.)?(([a-z]+)|([0-9.]+))(\\.[a-z]+)?/+.+$", "\\2", x$df))
#[1]  8 13 15  4  7

explanation:

  • ^http://: looks for "http://" after beginning of the string
  • (www\\.)?: looks for "www.", zero or one time (so this is optional)
  • (([a-z]+)|([0-9.]+)): the pattern that will be captured : either lowercase letters one or more time or digits and points
  • (\\.[a-z]+)?: looks for "." followed by one or more lowercase letters, zero or one time (so again optional)
  • /+.+$: looks for "/" followed by anything, one or more times till the end of string

NB:

sub("^http://(www\\.)?(([a-z]+)|([0-9.]+))(\\.[a-z]+)?/+.+$", "\\2", x$df)
# [1] "bursesvp"        "46.165.216.78"   "chalcedonyhotel" "bmbt"            "voeazul"  
Cath
  • 23,906
  • 5
  • 52
  • 86
  • This is great, and I prefer it to my own solution — but I’ve got to ask: why write `[/]{2}` instead of `/{2}` or even `//`, and why write `w{3}` instead of `www`? It’s longer and less readable. Also, instead of `(www)*` you should use `(www)?` because we want “zero or one”, not “zero or more” (and the same later). And one last thing: domain names can contain more than just letters, they can also contain digits and dashes, and many other things. So a character class won’t cut it, you probably have to accept anything except `.` and `/` here. – Konrad Rudolph Jan 13 '16 at 16:07
  • @KonradRudolph thanks, [/] it's to avoid having to escape / and [//] won't work. w{3} it's just because I didn't want to repeat www (personal choice...). Thanks for the remark on "?", it would indeed be more appropriate. I'm not used to ? apart from lookarounds. – Cath Jan 13 '16 at 16:12
  • Actually, [Wikipedia says](https://en.wikipedia.org/wiki/Domain_name) that only characters and numbers are allowed in domain names, and dashes when surrounded by the former (so this simplifies a fix for my concern noted above). And there’s no need to escape `/`. Only `\\` needs to be escaped. – Konrad Rudolph Jan 13 '16 at 16:13
  • @KonradRudolph good point, wrong idea of mine. I've edited accordingly, it makes the regex more readable, thanks – Cath Jan 13 '16 at 16:14
  • @KonradRudolph re *that only characters and numbers are allowed in domain names,* <- that's actually not true since 2009 when ICNA allowed non-ascii chars [ref here](https://www.icann.org/news/announcement-2009-10-30-en). So I would use `[^/.]+` to match the domain in a general manner. – Tensibai Jan 13 '16 at 16:21
6

Here’s how to do it (assuming your data.frame is called x):

domains = sub('^(http://)([^/]+)(.*)$', '\\2', x$df)
# This will fail for IP addresses …
hostname = sub('^(www\\.)?([^.]+)(\\..+)?$', '\\2', domains)
# … which we treat separately here:
is_ip = grepl('^(\\d{1,3}\\.){3}\\d{1,3}$', domains)
hostname[is_ip] = domains[is_ip]

exp_outcome$domain_length = nchar(hostname)

On a side note, I converted your original data.frame to character strings — it simply makes no sense to use a factor for URLs.

Konrad Rudolph
  • 530,221
  • 131
  • 937
  • 1,214
1

After 5 months of dealing with URLs in general, I found the following packages which make life a bit easier (Regex provided by other answers do work great by the way),

library(urltools)
library(iptools)

df$Hostname <- domain(df$URLs)
#However, TLDs and 'www' need to go so I used suffix_extract()$domain from `iptools` 
df$Hostname <- ifelse(is.na(suffix_extract(df$Hostname)$domain), df$Hostname, 
                                          suffix_extract(df$Hostname)$domain)

#which gives:
#   URLs                                                          Hostname
#1  http://bursesvp.ro//portal/user/_/...                         bursesvp
#2  http://46.165.216.78/.CartoesVotorantim/Usuarios/...          46.165.216.78
#3 http://www.chalcedonyhotel.com/images/promoc/                  chalcedonyhotel
#4 http://bmbt.ro/portal/a3/_Votorantim_/...                      bmbt
#5 http://voeazul.nl/azul/                                        voeazul

#then simply,

nchar(df$Hostname)
#[1]  8 13 15  4  7
Sotos
  • 51,121
  • 6
  • 32
  • 66