0

I want to remove everything after the first ? character in a URL. 3 of the 6 rows in my sample data contain the ? character; the other 3 are OK as is.

structure(list(URL = c("/2015/08/10/five-great-fantasy-books-most-fans-dont-know-exist/", 
"/2015/09/25/animated-dune-matt-rhodes-concept-art/", "/2015/09/09/the-dogs-of-athens-kendare-blake/?et_cid=34295599&et_rid=1476556397&linkid=http", 
"/2015/06/16/spin-the-wheel-1-the-wheel-of-time-companion/comment-page-4/", 
"/2015/06/29/excerpt-brandon-sanderson-shadows-of-self-prologue/?et_cid=34326143&et_rid=1724499137&linkid=http", 
"/2015/08/12/milagroso-isabel-yap/?et_cid=34174778&et_rid=559408553&linkid=http"
), Pageviews = c(100L, 200L, 113L, 100L, 50L, 13L)), .Names = c("URL", 
"Pageviews"), row.names = c(NA, -6L), class = "data.frame")

I tried:

df1$URL<-sub("?:.*$","",df1$URL)

and this seems to have no effect.

I also tried:

df1$URL<-sapply(str_split(df1$URL,"?"),"[",1)

and this generated an error message.

Third attempt:

df1$URL<-sapply(strsplit(df1$URL,"?"),"[",1)

removed everything from my URL field except a forward slash.

Rich Scriven
  • 97,041
  • 11
  • 181
  • 245
user3614783
  • 821
  • 6
  • 12
  • 20

2 Answers2

2

You can and prbly should use URL-specific tools to handle URLs. The urltools package has something ready-made for this:

library(urltools)

dat <- structure(list(URL = c("/2015/08/10/five-great-fantasy-books-most-fans-dont-know-exist/", 
"/2015/09/25/animated-dune-matt-rhodes-concept-art/", "/2015/09/09/the-dogs-of-athens-kendare-blake/?et_cid=34295599&et_rid=1476556397&linkid=http", 
"/2015/06/16/spin-the-wheel-1-the-wheel-of-time-companion/comment-page-4/", 
"/2015/06/29/excerpt-brandon-sanderson-shadows-of-self-prologue/?et_cid=34326143&et_rid=1724499137&linkid=http", 
"/2015/08/12/milagroso-isabel-yap/?et_cid=34174778&et_rid=559408553&linkid=http"
), Pageviews = c(100L, 200L, 113L, 100L, 50L, 13L)), .Names = c("URL", 
"Pageviews"), row.names = c(NA, -6L), class = "data.frame")


url_parse(dat$URL)$path
hrbrmstr
  • 77,368
  • 11
  • 139
  • 205
  • thanks, not sure I understand why the package solution is preferred. Isn't a general solution better, at least from a learning standpoint? Also, I don't recall my exact search terms, but I don't recall urltools surfacing in the searches I did. – user3614783 Nov 03 '15 at 17:14
  • There's `httr::parse_url` & `urltools::url_parse` (plus there are a cpl others). You're working with URLs. Data-domain-specific tools (generally) have taken into account any edge cases that you might not consider with a naive solution. Plus you may be able to use the other components it separates the URLs into. – hrbrmstr Nov 03 '15 at 17:19
1

You need to escape ? because ? is a special meta character in regex.

df1$URL <- sub("\\?.*","",df1$URL)
Avinash Raj
  • 172,303
  • 28
  • 230
  • 274
  • Thanks, this works perfectly. Am I always better off using the double backslash just in case the character I'm splitting on is a special character? – user3614783 Nov 03 '15 at 16:55
  • ya, in regex, `[`, `]`, `{`, `}`, `(`, `)`, `+`, `*` are treated as special chars. You must escape them. – Avinash Raj Nov 03 '15 at 16:57