Text mining - separating into columns based on keywords

Question

I have just started working in R. My requirement is as follows: I've got a csv file with a 'description' column (text generated from forms with questions 'Name', 'City' and 'Interests').

Example of 'description': "Name: XYZ, City: New York, Interests: I play the guitar, and spend my weekends playing badminton and tennis."

I would need to parse the text to 'XYZ', 'New York' and 'I play the guitar....' into 3 columns - 'Name', 'City' and 'Description'.

Is this possible with R, and how do I proceed?

`read.dcf(textConnection(gsub("(Interests|City)", "\n\\1", x)))` where "x" is your string? — A5C1D2H2I1M1N2O1R2T1, Feb 01 '18 at 12:34

Radim · Answer 1 · 2018-02-01T14:11:58.197

In base r, which is simple enough for your case (same logic as above, really, just more verbose):

raw <- read.csv(textConnection("Name: XYZ, City: New York, Interests: \"I play the guitar, and spend my weekends playing badminton and tennis.\""), col.names = c("Name", "City", "Interests"), header = F) 
raw <- gsub(pattern = "(Name\\:|City\\:|Interests\\:)", replacement = "", x = as.matrix(raw))
final <- data.frame(trimws(raw))

However, be careful how your text strings are formatted in your source file first. In your example, the comma in "I play the guitar,..." breaks the csv into four columns.

Or tidyverse:

raw <- read.csv(textConnection("Name: XYZ, City: New York, Interests: \"I play the guitar, and spend my weekends playing badminton and tennis.\""), col.names = c("Name", "City", "Interests"), header = F) 
library(tidyr, dplyr)
final <- raw %>% 
   separate(Name, c("Descriptor1", "Name"), sep = "\\:") %>%
   separate(City, c("Descriptor2", "City"), sep = "\\:") %>% 
   separate(Interests, c("Descriptor3", "Interests"), sep = "\\:") %>% 
   select(-contains("Descriptor"))

Text mining - separating into columns based on keywords

1 Answers1