0

I have just started working in R. My requirement is as follows: I've got a csv file with a 'description' column (text generated from forms with questions 'Name', 'City' and 'Interests').

Example of 'description': "Name: XYZ, City: New York, Interests: I play the guitar, and spend my weekends playing badminton and tennis."

I would need to parse the text to 'XYZ', 'New York' and 'I play the guitar....' into 3 columns - 'Name', 'City' and 'Description'.

Is this possible with R, and how do I proceed?

Sotos
  • 51,121
  • 6
  • 32
  • 66

1 Answers1

0

In base r, which is simple enough for your case (same logic as above, really, just more verbose):

raw <- read.csv(textConnection("Name: XYZ, City: New York, Interests: \"I play the guitar, and spend my weekends playing badminton and tennis.\""), col.names = c("Name", "City", "Interests"), header = F) 
raw <- gsub(pattern = "(Name\\:|City\\:|Interests\\:)", replacement = "", x = as.matrix(raw))
final <- data.frame(trimws(raw))

However, be careful how your text strings are formatted in your source file first. In your example, the comma in "I play the guitar,..." breaks the csv into four columns.

Or tidyverse:

raw <- read.csv(textConnection("Name: XYZ, City: New York, Interests: \"I play the guitar, and spend my weekends playing badminton and tennis.\""), col.names = c("Name", "City", "Interests"), header = F) 
library(tidyr, dplyr)
final <- raw %>% 
   separate(Name, c("Descriptor1", "Name"), sep = "\\:") %>%
   separate(City, c("Descriptor2", "City"), sep = "\\:") %>% 
   separate(Interests, c("Descriptor3", "Interests"), sep = "\\:") %>% 
   select(-contains("Descriptor"))
Radim
  • 455
  • 2
  • 11