Remove everything after "?" from URL in data frame using R

Question

I want to remove everything after the first ? character in a URL. 3 of the 6 rows in my sample data contain the ? character; the other 3 are OK as is.

structure(list(URL = c("/2015/08/10/five-great-fantasy-books-most-fans-dont-know-exist/", 
"/2015/09/25/animated-dune-matt-rhodes-concept-art/", "/2015/09/09/the-dogs-of-athens-kendare-blake/?et_cid=34295599&et_rid=1476556397&linkid=http", 
"/2015/06/16/spin-the-wheel-1-the-wheel-of-time-companion/comment-page-4/", 
"/2015/06/29/excerpt-brandon-sanderson-shadows-of-self-prologue/?et_cid=34326143&et_rid=1724499137&linkid=http", 
"/2015/08/12/milagroso-isabel-yap/?et_cid=34174778&et_rid=559408553&linkid=http"
), Pageviews = c(100L, 200L, 113L, 100L, 50L, 13L)), .Names = c("URL", 
"Pageviews"), row.names = c(NA, -6L), class = "data.frame")

I tried:

df1$URL<-sub("?:.*$","",df1$URL)

and this seems to have no effect.

I also tried:

df1$URL<-sapply(str_split(df1$URL,"?"),"[",1)

and this generated an error message.

Third attempt:

df1$URL<-sapply(strsplit(df1$URL,"?"),"[",1)

removed everything from my URL field except a forward slash.

score 2 · Answer 1 · answered Nov 03 '15 at 17:01

You can and prbly should use URL-specific tools to handle URLs. The urltools package has something ready-made for this:

library(urltools)

dat <- structure(list(URL = c("/2015/08/10/five-great-fantasy-books-most-fans-dont-know-exist/", 
"/2015/09/25/animated-dune-matt-rhodes-concept-art/", "/2015/09/09/the-dogs-of-athens-kendare-blake/?et_cid=34295599&et_rid=1476556397&linkid=http", 
"/2015/06/16/spin-the-wheel-1-the-wheel-of-time-companion/comment-page-4/", 
"/2015/06/29/excerpt-brandon-sanderson-shadows-of-self-prologue/?et_cid=34326143&et_rid=1724499137&linkid=http", 
"/2015/08/12/milagroso-isabel-yap/?et_cid=34174778&et_rid=559408553&linkid=http"
), Pageviews = c(100L, 200L, 113L, 100L, 50L, 13L)), .Names = c("URL", 
"Pageviews"), row.names = c(NA, -6L), class = "data.frame")


url_parse(dat$URL)$path

thanks, not sure I understand why the package solution is preferred. Isn't a general solution better, at least from a learning standpoint? Also, I don't recall my exact search terms, but I don't recall urltools surfacing in the searches I did. — user3614783, Nov 03 '15 at 17:14
There's `httr::parse_url` & `urltools::url_parse` (plus there are a cpl others). You're working with URLs. Data-domain-specific tools (generally) have taken into account any edge cases that you might not consider with a naive solution. Plus you may be able to use the other components it separates the URLs into. — hrbrmstr, Nov 03 '15 at 17:19

score 1 · Accepted Answer · answered Nov 03 '15 at 16:44

1

You need to escape ? because ? is a special meta character in regex.

df1$URL <- sub("\\?.*","",df1$URL)

answered Nov 03 '15 at 16:44

Avinash Raj

172,303
28
230
274

Thanks, this works perfectly. Am I always better off using the double backslash just in case the character I'm splitting on is a special character? – user3614783 Nov 03 '15 at 16:55
ya, in regex, `[`, `]`, `{`, `}`, `(`, `)`, `+`, `*` are treated as special chars. You must escape them. – Avinash Raj Nov 03 '15 at 16:57

Remove everything after "?" from URL in data frame using R

2 Answers2