1

I`m trying to transform a column into multiple columns.

This is my example:

df <- data.frame(Id = c(1,2,3), 
                 Col2 = c("['aaa' 'aaa aaaa' 'aaa.bb']","['aaa' 'aaa aaa' 'aaa bbb ccc' 'aaa'\n]","[]"))    
df


Id   Col2

1   ['aaa' 'aaa aaaa' 'aaa.bb']         
2   ['aaa' 'aaa aaa' 'aaa bbb ccc' 'aaa'\n]         
3   []

In my real case i can have 20 strings in each observation.

This would be my expected result:

df2 <- data.frame(Id =c(1,2,3),
             Col1 = c("aaa","aaa",NA),
             Col2 = c("aaa.aaaa","aaa.aaa",NA),
             Col3 = c("aaa.bb","aaa bbb ccc",NA),
             Col4 = c(NA,"aaa",NA))
df2


Id  Col1    Col2     Col3        Col4

1   aaa   aaa.aaaa  aaa.bb        NA
2   aaa   aaa.aaa   aaa bbb ccc  aaa
3   NA      NA       NA           NA

How can i separate by ""?

I tried to use separate function, but i can't seem to find the right sintax for "sep". "separate" seemed the best alternative for me, because I was using dplyr to maintain some previous columns.

Also, where can i find information about the using of sep. I haved saw some examples here, but I can't understand the rationale for using the characters

I also tried str_split, but i'm having trouble turning it back into a data frame.

Thanks in advance

RitaM
  • 143
  • 1
  • 6

2 Answers2

2

Add an explicit separator so that it is easy to get data in different columns. Here, I have added comma as a separator after every word group in single quotes ('....').

After that, perform some data cleaning using trimws and gsub, get data in long format using separate_rows splitting on comma, add an id column with data.table::rowid and get data in wide format in different columns using pivot_wider.

library(dplyr)
library(tidyr)

df %>%
  mutate(Col2 = gsub("('.*?')", "\\1,", Col2), 
         Col2 = trimws(Col2, whitespace = '\\[|\\]'), 
         Col2 = trimws(Col2, 'right', '[\n,]'), 
         Col2 = gsub("'", '', Col2)) %>%
  separate_rows(Col2, sep = ',\\s*') %>%
  mutate(row = data.table::rowid(Id)) %>%
  pivot_wider(names_from = row, values_from = Col2, names_prefix = 'col')

#     Id col1  col2     col3        col4 
#  <dbl> <chr> <chr>    <chr>       <chr>
#1     1 "aaa" aaa aaaa aaa.bb      NA   
#2     2 "aaa" aaa aaa  aaa bbb ccc aaa  
#3     3 ""    NA       NA          NA   
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • I Ronak, thanks for the reply. I just edit my question. The problem is that i can have more strings in each observation (forgot that detail...sory). I don't think it will be very simple to reproduce the solution for cases where I have 20 strings. I would have to manually create the 20 columns. And i still do not understand the code... :( Where can i find some good examples and explanations? whitespace = '\\[|\\]') ?? sep = ',\\s*' ?? – RitaM Dec 10 '21 at 03:20
  • @RitaM - this doesn't manually create each string. The first bit with the multiple `Col2 = ...` strings is just overwriting the same `Col2` column multiple times to clean it up. This will work with 4 columns or 200 columns. – thelatemail Dec 10 '21 at 04:00
  • I am very sorry for the mistake. :( Your answer is absolutely right!! can you help me and indicate some links where I can understand the rules applied in the code? I mean whitespace = '\\[|\\]' p.e or sep = ',\\s*'. I am not finding clear information on the subject. Thank you very much and once again my apologies – RitaM Dec 10 '21 at 14:33
0

You've essentially got quoted strings 'text text' wrapped in []. So, strip out the []'s and read it like you would a normal text file with quoted strings:

cbind(
  df["Id"],
  read.table(text=gsub("[][]|\n", "", df$Col2), sep=" ", quote="'",
             header=FALSE, fill=TRUE, blank.lines.skip=FALSE, na.strings="")
)
#  Id   V1       V2          V3   V4
#1  1  aaa aaa aaaa      aaa.bb <NA>
#2  2  aaa  aaa aaa aaa bbb ccc  aaa
#3  3 <NA>     <NA>        <NA> <NA>   
thelatemail
  • 91,185
  • 12
  • 128
  • 188
  • I have no idea what the reason is, but for some reason when I do read.table it only creates 7 columns and that's why in some cases (detected for example in a string with 8 quotes) it creates a new line just with this quote. As a result it returns the error "Error in data.frame(..., check.names = FALSE) : arguments imply differing number of rows: 2532, 3178" – RitaM Dec 10 '21 at 14:24
  • @RitaM - I'm not sure why this would be the case. Possibly because there are multiple line-breaks with spaces between them or something else not shown here. I can only work with the example data shown - I'd be happy to have another attempt if you can show an example of a couple of rows where this logic breaks. – thelatemail Dec 12 '21 at 22:21