0

I have a string in the below format:

a <- c("first_name=James(Mr), cust_id=98503(ZZW_LG,WGE,zonaire),
       StartDate=2015-05-20, EndDate=2015-05-20, performance=best")

My aim is to get the final result in a dataframe as below:

first_name   cust_id   start_date    end_date    performance           cust_notes
 James(Mr)     98503   2015-05-20  2015-05-20           best   ZZW_LG,WGE,zonaire

I ran the following code:

a <- c("first_name=James(Mr), cust_id=98503(ZZW_LG,WGE,zonaire),
       StartDate=2015-05-20, EndDate=2015-05-20, performance=best")

split_by_comma <- strsplit(a,",")

split_by_equal <- lapply(split_by_comma,strsplit,"=")

Since the custid had got additional commas and brackets, I am not getting desired result.

Please note that brackets in first name are genuine and needed as it is.

maddy kemen
  • 67
  • 2
  • 9
  • according to your example, using `split_by_comma <- strsplit(a,", ")` (with a space after the comma) should do the trick. Edit: while trying your code I got hit by a carriage return and had to switch to `split_by_comma <- strsplit(a,c(", ","\n")` with two split options, comma and space or newline. This would work only if there no comma followed by space in your datas. (not really an answer so posting as a comment) – Tensibai Aug 19 '15 at 12:14

3 Answers3

1

You need to split by this.

,(?![^()]*\\))

You need lookahead.This will not split by , within ().See demo.

https://regex101.com/r/uF4oY4/82

To get desired result use

split_by_comma <- strsplit(a,",(?![^()]*\\))",perl=TRUE)

split_by_equal <- lapply(split_by_comma,strsplit,"=")
vks
  • 67,027
  • 10
  • 91
  • 124
  • I get below error when i execute that Error in strsplit(d, ",(?![^()]*\\))") : invalid regular expression ',(?![^()]*\))', reason 'Invalid regexp' – maddy kemen Aug 19 '15 at 09:32
  • 2
    Because `perl=T` is required with R regex lookarounds. – Wiktor Stribiżew Aug 19 '15 at 09:33
  • @maddykemen can you try with `perl=T` option – vks Aug 19 '15 at 09:34
  • here the custid splits all the way and we do have text part also.Thats where I was facing problem initially – maddy kemen Aug 19 '15 at 09:37
  • @maddykemen you can use `gsub` later to remove `(?<=\\d)\\(.*?\\)` and replace by `space` before split by `=` – vks Aug 19 '15 at 09:39
  • @maddykemen after split by `,(...)` do a gsub over all elements of list to remove `()` from them – vks Aug 19 '15 at 09:42
  • @maddykemen 1)do `strsplit(a,",(?![^()]*\\))",perl=TRUE)` 2)Loop over and do `gsub("(?<=\\d)\\(.*?\\)","",some_string)` 3)do `lapply(split_by_comma,strsplit,"=")` – vks Aug 19 '15 at 09:52
  • Is there anyway to get in dataframe format first_name cust_id start_date end_date performance cust_notes James(Mr) 98503 2015-05-20 2015-05-20 best ZZW_LG,WGE,zonaire – maddy kemen Aug 19 '15 at 09:57
0

If your string format holds true, this might be a quick solution:

library(httr)

a <- c("first_name=James(Mr), cust_id=98503(ZZW_LG,WGE,zonaire), StartDate=2015-05-20, 
        EndDate=2015-05-20, performance=best")

dat <- data.frame(parse_url(sprintf("?%s", gsub(",[[:space:]]+", "&", a)))$query, 
           stringsAsFactors=FALSE)

library(tidyr)
library(dplyr)

mutate(separate(dat, cust_id, into=c("cust_id", "cust_notes"), sep="\\("), 
       cust_notes=gsub("\\)", "", cust_notes))

##   first_name cust_id         cust_notes  StartDate    EndDate performance
## 1  James(Mr)   98503 ZZW_LG,WGE,zonaire 2015-05-20 2015-05-20        best

Extrapolation:

  • gsub(",[[:space:]]+", "&", a) makes the parameters look like a components of a URL query string.
  • sprintf(…) make it look like an actual query string
  • parse_url (from httr) will separate the key/value pairs out and stick them in a list (named query) in the returned list
  • data.frame will, well…
  • separate will split the cust_id column for you at the ( into two columns
  • mutate will remove the ) in the new cust_notes column

Here's the whole thing as a "pipe":

library(httr)
library(tidyr)
library(dplyr)
library(magrittr)

a <- c("first_name=James(Mr), cust_id=98503(ZZW_LG,WGE,zonaire), StartDate=2015-05-20, 
        EndDate=2015-05-20, performance=best")

a %>% 
  gsub(",[[:space:]]+", "&", .) %>% 
  sprintf("?%s", .) %>% 
  parse_url() %>% 
  extract2("query") %>% 
  data.frame(stringsAsFactors=FALSE) %>% 
  separate(cust_id, into=c("cust_id", "cust_notes"), sep="\\(") %>% 
  mutate(cust_notes=gsub("\\)", "", cust_notes))

which matches the extrapolation and is (IMO) easier to follow.

hrbrmstr
  • 77,368
  • 11
  • 139
  • 205
0

Late reply, but posted it since its very simple to understand and implement without using any additional packages

rawdf = read.csv("<your file path>", header = F, sep = ",", stringsAsFactors = F)
# Get the first row of the dataframe and transpose it into a column of a df
colnames = data.frame(t(rawdf[1,]))

# Split the values of the single column df created above into its key value
# pairs which are separated by '=' and save in a vector
colnames = unlist(strsplit(as.character(colnames$X1), "="))

# Pick up all the odd indexed values from the above vector (all odd places
# are colnames and even places the values associated with them)
colnames = colnames[seq(1,length(colnames),2)]

# Assign the extracted column names from the vector above to your original data frame
colnames(rawdf) = colnames

# Use the regex to extract the value in each field of the original df by
# replacing the 'Key=' pattern present in each field with an empty string 
for(i in 1:dim(rawdf)[2]) rawdf[,i] = gsub(paste(colnames[i],"=",sep=""), "", rawdf[,i])