URL semantics analysis in R

Question

I have a dataset containing various urls.

https://www.thetrainline.com/buytickets/combinedmatrix.aspx?Command=TimeTable
https://wwf-fb.zyngawithfriends.com/wwf-fb.a84485c126e67ea2787c.html
http://www.thetrainline.com/destinations/trains-to-london

I want to do a semantics analysis of the url (the keywords in the URL after the /).

Please help me out.

Thanks

What's "semantics analysis" in that context? Which slash? What did you try? — lukeA, Feb 05 '16 at 12:44
like in the third url, i need to extract destinations and trains to london from the url. i am not exactly very well familiar with the concepts of regular expressions. — Bitanshu Das, Feb 05 '16 at 12:50
Something like that? `gsub('^(?:[^/]*/){3}','/', 'http://www.thetrainline.com/destinations/trains-to-london')` — Sotos, Feb 05 '16 at 12:52
like i want to seperate everything after .com or .in or .net whatever be the main url — Bitanshu Das, Feb 05 '16 at 12:55

score 3 · Accepted Answer · answered Feb 05 '16 at 14:51

3

This is substantially faster and more comprehensive than you're going to get doing it manually.

library(urltools)

URLs <- c("https://www.thetrainline.com/buytickets/combinedmatrix.aspx?Command=TimeTable",
          "https://wwf-fb.zyngawithfriends.com/wwf-fb.a84485c126e67ea2787c.html",
          "https:/test.com/thing.php?a=1&b=2",
          "http://www.thetrainline.com/destinations/trains-to-london")

url_parse(URLs)

##   scheme                      domain port                             path         parameter fragment
## 1  https        www.thetrainline.com        buytickets/combinedmatrix.aspx command=timetable         
## 2  https wwf-fb.zyngawithfriends.com      wwf-fb.a84485c126e67ea2787c.html                           
## 3                              https                    test.com/thing.php           a=1&b=2         
## 4   http        www.thetrainline.com         destinations/trains-to-london

answered Feb 05 '16 at 14:51

hrbrmstr

77,368
11
139
205

What if we have irregular malicious URLs? i.e. (http://.https://www.something-bad.com/something/whatever.com/https://www.somethingelse.org/)Will it still recognise the different parts? (domain, path, etc.) – Sotos Feb 05 '16 at 15:04
Also, your 3rd URL is not parsed correctly...is it?...Oh, you forgot a `/` – Sotos Feb 05 '16 at 15:10
If you try and parse this: `url2 <- c('http://https://gallery46.co.il/wp-content/themes/twentytwelve/js/cj/1.html?http://www.freefilefillableforms.com + ')` it will fail. – Sotos Feb 05 '16 at 15:23
@Sotos because it's an invalid url. Pull requests are welcome. – hrbrmstr Feb 05 '16 at 15:54
I did come across actual online phishing sites that do use the "https://" in their domain - I have 251 such cases in my dataset- so as to confuse users. It is a great package, but could be updated to include invalid-like URLs. – Sotos Feb 05 '16 at 15:57
1

the behavior you'd like for invalid URLs shld be…? the pkg works with https URLs. try `httr::parse_url`, too. – hrbrmstr Feb 05 '16 at 16:05
Will do. Monday when I' m back at work I will try it with my "invalids" and let you know. Thanks. – Sotos Feb 05 '16 at 17:52

score 1 · Answer 2 · answered Feb 05 '16 at 12:56

1

URLs1 <- c('http://www.thetrainline.com/destinations/trains-to-london', 'https://wwf-fb.zyngawithfriends.com/wwf-fb.a84485c126e67ea2787c.html', 'https://www.thetrainline.com/buytickets/combinedmatrix.aspx?Command=TimeTable')
> gsub('^(?:[^/]*/){3}','/', URLs1)
[1] "/destinations/trains-to-london"                    "/wwf-fb.a84485c126e67ea2787c.html"                
[3] "/buytickets/combinedmatrix.aspx?Command=TimeTable"
>

answered Feb 05 '16 at 12:56

Sotos

51,121
6
32
66

thanks. i am lacking in the knowledge of regular expressions – Bitanshu Das Feb 05 '16 at 12:58

URL semantics analysis in R

2 Answers2