0

Example of 10 'Referer URl' is shown below

https://www.google.com/ | query_string=utm_source=google&utm_medium=cpc&utm_campaign=121434112139&utm_term=&utm_content=Shirts&gclid=CXjadiOcHGGw6JEiJaf5zMhRxFk-AOtiXMOd_1szoBoCUEMQAvD_BwE | ip_address=123.21.62.57 | user_agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:80.0) Gecko/20100101 Firefox/80.0
https://www.Type2online.com/ | query_string=null | ip_address=113.193.43.211 | user_agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.102 Safari/537.36
https://www.google.com/ | query_string=gclid=CjwKCAjwh7H7BRBBEiwAPXjadn8fnPPR6HnqZrsK46JGDHKOo-C2jxHa1JW7V2glY_Lxi6sNo-AAdRoCDAcQAvD_BwE | ip_address=187.11.116.117 | user_agent=Mozilla/5.0 (Linux; Android 8.0.0; SM-C701F) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Mobile Safari/537.36

Other URLs with no parameters are
https://m.facebook.com/
instagram.com
https://l.facebook.com
/https://www.google.com/
http://m.facebook.com


I am using the below code to seperate the above URL parameters and create a new column for each of the parameters

Mydata$ref_url<-trimws(matrix(unlist(strsplit(as.character(Mydata$'Referer URL'),'|',fixed=TRUE)),ncol = 4, byrow = TRUE)[,1])

Mydata$query_string<-gsub("query_string=","",trimws(matrix(unlist(strsplit(as.character(Mydata$'Referer URL'),'|',fixed=TRUE)),ncol = 4, byrow = TRUE)[,2]))

Mydata$ip_address<-gsub("ip_address=","",trimws(matrix(unlist(strsplit(as.character(Mydata$'Referer URL'),'|',fixed=TRUE)),ncol = 4, byrow = TRUE)[,3]))

Mydata$user_agent<-gsub("user_agent=","",trimws(matrix(unlist(strsplit(as.character(Mydata$'Referer URL'),'|',fixed=TRUE)),ncol = 4, byrow = TRUE)[,4]))

Using each of these function, I am getting the below error:

    Error: Assigned data `trimws(...)` must be compatible with existing data.
    x Existing data has 2645 rows.
    x Assigned data has 1096 rows.
    i Only vectors of size 1 are recycled.
    Run `rlang::last_error()` to see where the error occurred.
    In addition: Warning message:
    In matrix(unlist(strsplit(as.character(Mydata$"Referer URL"), "|",  :
      data length [4382] is not a sub-multiple or multiple of the number of rows [1096]

How to rectify this issue?

Anonymus
  • 13
  • 5
  • Have a look at [How to Ask](https://stackoverflow.com/help/how-to-ask) - the best way to get help here is to post example data, and expected output, along with an explanation of your problem. The description you've given is a little hard to follow without a concrete example. When possible, post a [minimal reproducible example](https://stackoverflow.com/help/minimal-reproducible-example). – Abdessabour Mtk Oct 06 '20 at 18:55
  • @Anonymus You haven't given enough information to help you, but there are a couple of things that I can see. Firstly, `strsplit` produces a list of substrings, but it is likely that there are a different number of substrings in each list element (i.e. a different number of `|` characters in each Referer URL). Your code implies that you are expecting exactly 3 `|` characters in each one. I would check that assumption. Secondly, you are trying to write a 4-column matrix into a single dataframe column. That doesn't seem right. – Allan Cameron Oct 06 '20 at 18:59
  • @AllanCameron added more details, please check – Anonymus Oct 06 '20 at 19:08
  • @Anonymus your code would work if your data frame had the single url in your example. But like I said in my last comment and last answer, you are assuming that it should work for all your urls. This is very unlikely, because if any of them have more or less than 3 `|` symbols, or if they have different fields (as they did in your last question) then your code will break. Without seeing the other URLs there is no way for anyone to help you. Text parsing is very dependent on the **exact** text you are trying to parse. There is no quick fix here. We need to see your data. – Allan Cameron Oct 06 '20 at 19:18
  • @AllanCameron adding top 10 rows in the dataset – Anonymus Oct 06 '20 at 19:20
  • @AllanCameron added, please check & update – Anonymus Oct 06 '20 at 19:25
  • @Anonymus your code works fine on that sample too. So the first 10 rows work. Why don't you test the first 20? Then the first 40? Then the first 80. At some point, the code will break and you'll be able to home in on the line that is causing it to break – Allan Cameron Oct 06 '20 at 19:34
  • @AllanCameron the thing is there are URLs which do not contain the query strings, ip address and other parameters..it only includes domain..like www.google.com in the URL..So part of URL is like the above one and part of it contains domain..then how should I proceed? – Anonymus Oct 06 '20 at 19:42
  • 1
    @Anonymus exclude them from your data frame? – Allan Cameron Oct 06 '20 at 19:45
  • No I need all the data, wherever the values are not present need to show NA and wherever it is present..I need to parse them like in the formula..without any loss of data – Anonymus Oct 06 '20 at 19:59
  • Is it possible to seperate everything without loss of data @AllanCameron – Anonymus Oct 06 '20 at 20:37
  • @Anonymous if you include a more realistic sample that includes URLs that break your current code I can have a look – Allan Cameron Oct 06 '20 at 21:26
  • @AllanCameron..I have added few URLs which does not contain the parameters..So for such cases I want the rows to show NA or blank values..and if query_string is present in the URL then seperate – Anonymus Oct 06 '20 at 21:39

1 Answers1

0

Using tidyverse if you can guarantee that all the params have the same order the following code gives the wanted output :

library(tidyverse)
ref %>% separate(V1, paste0("V",2:5), sep=" \\| ") -> separated
names(separated) <- c("url", gsub("=.+", "", separated[1,2:4]))
separated %>% mutate_all( ~ sub(".+?=","", .)) 
#>                            url                                                                                                                                          query_string     ip_address                                                                                                                    user_agent
#> 1      https://www.google.com/ utm_source=google&utm_medium=cpc&utm_campaign=121434112139&utm_term=&utm_content=Shirts&gclid=CXjadiOcHGGw6JEiJaf5zMhRxFk-AOtiXMOd_1szoBoCUEMQAvD_BwE   123.21.62.57                                            Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:80.0) Gecko/20100101 Firefox/80.0
#> 2 https://www.Type2online.com/                                                                                                                                                  null 113.193.43.211           Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.102 Safari/537.36
#> 3      https://www.google.com/                                                     gclid=CjwKCAjwh7H7BRBBEiwAPXjadn8fnPPR6HnqZrsK46JGDHKOo-C2jxHa1JW7V2glY_Lxi6sNo-AAdRoCDAcQAvD_BwE 187.11.116.117 Mozilla/5.0 (Linux; Android 8.0.0; SM-C701F) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Mobile Safari/537.36
#> 4      https://m.facebook.com/                                                                                                                                                  <NA>           <NA>                                                                                                                          <NA>
#> 5                instagram.com                                                                                                                                                  <NA>           <NA>                                                                                                                          <NA>
#> 6       https://l.facebook.com                                                                                                                                                  <NA>           <NA>                                                                                                                          <NA>
#> 7     /https://www.google.com/                                                                                                                                                  <NA>           <NA>                                                                                                                          <NA>
#> 8        http://m.facebook.com                                                                                                                                                  <NA>           <NA>                                                                                                                          <NA>

Abdessabour Mtk
  • 3,895
  • 2
  • 14
  • 21