Parsing URL in R to extract specific data

Question

I have a data file which has a URL column in it. It looks something like this "https://www.google.com/ | query_string=utm_source=abc&utm_medium=yts&utm_campaign=123campaign&utm_term=camp%123&utm_content=brand&gclid=abcdefg|user_agent=xyz"

I want these data in seperate columns with their respective values as shown below

utm_source utm_medium utm_campaign utm_term utm_content user_agent
abc          yts      123campaign  camp%123   brand         xyz

Using dput for URL results in

c("https://www.google.com/ | query_string=null | ip_address=123.113.64.211 | user_agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.102 Safari/537.36","https://www.google.com/ | query_string=gclid=Lxi6sNo-A17RohDAcQgvD_fw4 | ip_address=167.11.116.237 | user_agent=Mozilla/5.0 (Linux; Android 8.0.0; SM-C701F) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Mobile Safari/537.36","http://m.facebook.com/ | query_string=utm_source=fb&utm_medium=ctw&utm_campaign=abcPant_rem&utm_content=PantShirt | ip_address=106.193.181.252 | user_agent=Mozilla/5.0 (Linux; Android 10; SM-G975F Build/QP1A.190711.020; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/81.0.4044.138 Mobile Safari/537.36 [FB_IAB/FB4A;FBAV/218.0.0.32.158;]")

are you sure the link you provided works? – Waldi Sep 28 '20 at 12:03 — Waldi, Sep 28 '20 at 12:03
No thats just a sample..but that's how the structure is – anonymus Sep 29 '20 at 04:18 — anonymus, Sep 29 '20 at 04:18

Allan Cameron · Answer 1 · 2020-10-04T14:47:25.020

0

Only one of the entries in URL contains multiple fields in the query string, and the first one doesn't contain any. You can't really make a data frame from the example in the question, but you can make a list of named vectors containing the fields in the query string like this:

queries <- sapply(strsplit(sapply(strsplit(URL, "query_string="), 
                           `[`, 2), " \\|"), `[`, 1)

lapply(strsplit(queries, "\\&|="), function(x) 
  setNames(x[seq(length(x)/2) * 2], x[seq(length(x)/2) * 2 - 1]))
#> [[1]]
#> null 
#>   NA 
#> 
#> [[2]]
#>                       gclid 
#> "Lxi6sNo-A17RohDAcQgvD_fw4" 
#> 
#> [[3]]
#>    utm_source    utm_medium  utm_campaign   utm_content 
#>          "fb"         "ctw" "abcPant_rem"   "PantShirt"

edited Oct 04 '20 at 14:47

answered Sep 28 '20 at 12:14

Allan Cameron

147,086
7
49
87

I used the first command which is fields=strsplit...but getting the below error In charToRaw(URL) : argument should be a character vector of length 1 all but the first element will be ignored – anonymus Sep 28 '20 at 12:30
@anonymus that doesn't make any sense. Is `url` a character vector? – Allan Cameron Sep 28 '20 at 13:24
Its a column consisting of multiple URLs. I am still getting an error.So if my data is stored in Mydata and the URL column inside Mydata is MyURL then I have replaced your command as fields <- strsplit(URLdecode(URL =Mydata$MyURL), "&|=")[[1]][-1] – anonymus Sep 29 '20 at 21:45
Any answer @Allan – anonymus Sep 30 '20 at 12:41
@anonymous it seems as though URLdecode doesn't like vectorized input. Do you get a correct result with `Mydata$MyURL[1]`? – Allan Cameron Sep 30 '20 at 12:49
USed what you have mentioned but in the second seq code getting this error Error in names(object) <- nm : 'names' attribute [1] must be the same length as the vector [0] – anonymus Oct 01 '20 at 05:21
not getting the solution using this..I am still stuck here – anonymus Oct 03 '20 at 16:20
can you please help with this? – anonymus Oct 04 '20 at 09:38
@anonymus Could you maybe edit your question to include the result of `dput(mydata)`? I'm sure I could get it to work if you did that. – Allan Cameron Oct 04 '20 at 11:15
@anonymus see my update. With the data you have, the fields in the query strings are all different. You can't make a useful data frame out of them. – Allan Cameron Oct 04 '20 at 14:48
@anonymus You must misunderstand me. I'm not saying that I can't find a solution, I'm saying that all your URLs have different fields, so it wouldn't make sense to put them in a data frame. In the first row for example, when the query string is `null`, what would you put in the columns `utm_medium`, `utm_source` etc? You _could_ put them in a dataframe by having all the fields from each url as column headers, but for the URL object in your question, this would be pointless. There _is_ no remaining problem - your data is what it is. – Allan Cameron Oct 04 '20 at 19:24
Understood @AllanCameron – anonymus Oct 04 '20 at 19:34

score 0 · Answer 2 · answered Sep 28 '20 at 12:32

Here's a regex solution using the URL provided.

url <- "https://www.google.com/ | query_string=utm_source=abc&utm_medium=yts&utm_campaign=123campaign&utm_term=camp%123&utm_content=brand&gclid=abcdefg|user_agent=xyz"

str_match_all extracts patterns.

\\w+: Match one or more word characters
(...): Capture group
(?:...)?: Match group zero or one times, but don't capture the group. This is used to handle the query_string= part of the URL.

stringr::str_match_all(url, "(?:\\w+=)?(\\w+)=(\\w+)")
#> [[1]]
#>      [,1]                          [,2]           [,3]         
#> [1,] "query_string=utm_source=abc" "utm_source"   "abc"        
#> [2,] "utm_medium=yts"              "utm_medium"   "yts"        
#> [3,] "utm_campaign=123campaign"    "utm_campaign" "123campaign"
#> [4,] "utm_term=camp"               "utm_term"     "camp"       
#> [5,] "utm_content=brand"           "utm_content"  "brand"      
#> [6,] "gclid=abcdefg"               "gclid"        "abcdefg"    
#> [7,] "user_agent=xyz"              "user_agent"   "xyz"

str_match_all returns a list of matrices where the first column is the complete match followed by each of the captured groups. Keep only the captured groups.

stringr::str_match_all(url, "(?:\\w+=)?(\\w+)=(\\w+)")[[1]][,2:3]
#>      [,1]           [,2]         
#> [1,] "utm_source"   "abc"        
#> [2,] "utm_medium"   "yts"        
#> [3,] "utm_campaign" "123campaign"
#> [4,] "utm_term"     "camp"       
#> [5,] "utm_content"  "brand"      
#> [6,] "gclid"        "abcdefg"    
#> [7,] "user_agent"   "xyz"

I tried this but getting an error like as shown below: Error in stri_match_all_regex(string, pattern, omit_no_match = TRUE, opts_regex = opts(pattern)) : argument `str` should be a character vector (or an object coercible to) — anonymus, Sep 29 '20 at 21:47
The `url` variable needs to be a character vector. Look at the result of `class(url)`. — Paul, Sep 29 '20 at 22:18
class URL is a character. I am having a column consisting of similar URL's. For one URL, I am getting this solution but when I consider for all, I am unable to split — anonymus, Oct 04 '20 at 10:11

Parsing URL in R to extract specific data

2 Answers2