0

I have a huge dataframe of thousands of rows imported from the .csv file. Although the text inside is comma separated, R recognizes the imported dataframe as a single column (my guess is - due to its overcomplicated and irregular structure). I want to search every row for any string carrying @ symbol (like @marine, @tested etc.) and put all of them into another column. Unfortunately, the rows are of different length.

Here's what I have (2 rows example):

"254,""CC4qoAPgs0p"",""_ohc=l5OM-bXL0z4AX_eH6id&oh=246b0f63e5f90a14f28e0f9e40989372&oe=5F402F36"",""8"",""26793924834"",""How do you relax at night after a long day working? #doterra #doterraessentialoils @aromatherapy #essentialoils @terra #healthandwellness @terra @doterraoils2 #vegan #healthy #stressfree, 254a

"255,""DC4qDVhJRYH"",""_ohc=52ns_Li8iNQAX9wNlw6&oh=5c6b7f2193799aa6755b67ea6acec857&oe=5F41C4CA"",""12"",""37345461877""," "<U+0001F4F2> https://wa.me/60169573359  Anis Nadzirah Shaklee Independent Distributor Kuala @Berang  @shaklee %shaklee%lover, 255a

I would like to have something like this:

number       tags
254         @aromatherapy
            @terra
            @terra
            @doterraoils2

255         @Berang
            @shaklee

I tried to do this with data.table package

library(data.table)
section <-  df[rownames(a) %like% "@", ]

but got rather strange results, out of 10K rows it created me only 27. Can somebody help me with this? thank you in advance.

kshtwork
  • 29
  • 5

1 Answers1

1

The quotes are getting in the way and need to be removed. Then you need some regex to extract the terms that start with @. This should get you on your way: I use readLines to read the data and use stringr::str_replace_all to get rid of all the quotes. The first lapply extracts words starting with @ (see here) and the second lapply extracts the number. We then need to combine things into a data frame. This can probably be further simplified.

library(stringr)

data <- readLines("data_with_quotes.csv")
data <- str_replace_all(string = data, pattern = "\"", replacement  = "")

l <- lapply(data, FUN = function(x) str_extract_all(x, "(?<=^|\\s)@[^\\s]+"))
h <- lapply(data, FUN = function(x) str_sub(x, start = 1, end = 3))

df <- data.frame(matrix(unlist(l), nrow = length(l), byrow = T))
row.names(df) <- unlist(h)

With this output:

> df
               X1      X2       X3
254 @aromatherapy  @terra   @terra
255 @doterraoils2 @Berang @shaklee
Paul van Oppen
  • 1,443
  • 1
  • 9
  • 18