Search for and extract words into new column

Question

I have a dataset with 63,000 rows in R. One of the columns contained a list of words in the format '"Fireplace", "Garage", "One story with balcony", "Off street parking",' etc. They are property characteristics listed from sale websites.

I want to extract words from this column and create a new column that has a '0' or a '1' if the word is present or not (creating a dummy variable for regression). Once that has been completed, I want to be able to merge some of those columns together (ie take 'parking' 'Parking' 'garage' 'Garage' columns and merge them into one that includes all parking and garages). I'm assuming that R is sensitive to upper and lowercase characters but even if not, I need to be able to merge 'parking' and 'garage' together, for example.

This is for a hedonic pricing method so I need as many property characteristic variables as possible.

I don't know how to create new dummy variables or merge them into one column once I have, so am struggling. Would appreciate any help.

Please provide a small sample of data to illustrate the problem, `dput(your_data[1:5, "name_of_relevant_column", drop = FALSE])` would be perfect for the first 5 rows. Also share a short list of sample relevant keywords to turn into dummies, including some "merge" examples. — Gregor Thomas, Jul 12 '22 at 16:57
@Julien When a new user says they have 63,000 rows of data, I'd strongly recommend **not** asking for `dput(your_df)`. Show how to subset the data and specifically ask for a small sample. — Gregor Thomas, Jul 12 '22 at 16:59
c("[\"Stunning seaside location\", \"24-hour emergency call system and secure video entry\", \"Mature landscaped gardens with large terraces and seating areas\", \"Walk out balconies to selected apartments\", \"Beautifully decorated homeownersâ\200\231 lounge\", \"Parking spaces and car ports are available via an annual permit\", \"Wheelchair access\", \"Lifts to all floors\", \"Fire detection\", \"Intruder alarm\"]", "[\"Village Location, 4 Bedrooms, Garden(s)\"]", "[\"Balcony\", \"On street/residents parking\", \"Central heating\", \"Double glazing\", \"Fireplace\", \"Rural/secluded\"] — Lucy Kingwell-Meader, Jul 12 '22 at 17:24
Keywords to turn into dummies: 'Parking' 'Garage' 'Garden' 'Freehold' 'Fireplace' 'Balcony'. Wanting to merge 'Parking' and 'Garage' into one. Thank you! — Lucy Kingwell-Meader, Jul 12 '22 at 17:26
Basically, in terms of merging, I'm aware that some characteristics will be down as plurals, capitalised etc etc so want to make sure I'm getting all of those into one dummy variable. — Lucy Kingwell-Meader, Jul 12 '22 at 17:28

Chris Ruehlemann · Answer 1 · 2022-07-12T20:08:55.940

Is this what you are looking for?

library(tidyverse)
data.frame(txt) %>%
  # tidy up `txt`:
  mutate(txt = gsub("(?![, ])\\W", "", txt, perl = TRUE)) %>%
  # split into rows
  separate_rows(txt, sep = ",") %>%
  # extract keywords matched:
  mutate(keywords = str_extract(txt, "(?i)Parking|Garage|Garden|Freehold|Fireplace|Balcony"))
# A tibble: 19 × 2
   txt                                                                keywords 
   <chr>                                                              <chr>    
 1 "Stunning seaside location"                                        NA       
 2 " 24hour emergency call system and secure video entry"             NA       
 3 " Mature landscaped gardens with large terraces and seating areas" garden   
 4 " Walk out balconies to selected apartments"                       NA       
 5 " Beautifully decorated homeowners8099 lounge"                     NA       
 6 " Parking spaces and car ports are available via an annual permit" Parking  
 7 " Wheelchair access"                                               NA       
 8 " Lifts to all floors"                                             NA       
 9 " Fire detection"                                                  NA       
10 " Intruder alarm"                                                  NA       
11 " Village Location"                                                NA       
12 " 4 Bedrooms"                                                      NA       
13 " Gardens"                                                         Garden   
14 " Balcony"                                                         Balcony  
15 " On streetresidents parking"                                      parking  
16 " Central heating"                                                 NA       
17 " Double glazing"                                                  NA       
18 " Fireplace"                                                       Fireplace
19 " Ruralsecluded"                                                   NA

Data:

txt <- '"[\"Stunning seaside location\", \"24-hour emergency call system and secure video entry\", \"Mature landscaped gardens with large terraces and seating areas\", \"Walk out balconies to selected apartments\", \"Beautifully decorated homeownersâ\200\231 lounge\", \"Parking spaces and car ports are available via an annual permit\", \"Wheelchair access\", \"Lifts to all floors\", \"Fire detection\", \"Intruder alarm\"]", "[\"Village Location, 4 Bedrooms, Garden(s)\"]", "[\"Balcony\", \"On street/residents parking\", \"Central heating\", \"Double glazing\", \"Fireplace\", \"Rural/secluded\"]"'

If there may be more than 1 keyword per substring, then use str_extract_all in this way:

data.frame(txt) %>%
  mutate(txt = gsub("(?![, ])\\W", "", txt, perl = TRUE)) %>%
  separate_rows(txt, sep = ",") %>%
  mutate(keywords = str_extract_all(txt, "(?i)Parking|Garage|Garden|Freehold|Fireplace|Balcony")) %>%
  unnest(where(is.list), keep_empty = TRUE)

EDIT:

If the OP is looking to obtain a variable for each keyword, then this works:

data.frame(txt) %>%
  mutate(txt = gsub("(?![, /])\\W", "", txt, perl = TRUE)) %>%
  separate_rows(txt, sep = ", ") %>%
  mutate(keywords = str_extract_all(txt, "(?i)Parking|Garage|Garden|Freehold|Fireplace|Balcony")) %>%
  # unnest listed items:
  unnest(where(is.list), keep_empty = TRUE) %>%
  # capitalize initial letter:
  mutate(keywords = sub("^(.)", "\\U\\1", keywords, perl = TRUE)) %>%
  # cast each keaword into its own column:
  pivot_wider(names_from = keywords, values_from = keywords, 
              values_fn = function(x) 1, values_fill = 0) 
# A tibble: 19 × 6
   txt                                                              `NA` Garden Parking Balcony Fireplace
   <chr>                                                           <dbl>  <dbl>   <dbl>   <dbl>     <dbl>
 1 Stunning seaside location                                           1      0       0       0         0
 2 24hour emergency call system and secure video entry                 1      0       0       0         0
 3 Mature landscaped gardens with large terraces and seating areas     0      1       0       0         0
 4 Walk out balconies to selected apartments                           1      0       0       0         0
 5 Beautifully decorated homeowners8099 lounge                         1      0       0       0         0
 6 Parking spaces and car ports are available via an annual permit     0      0       1       0         0
 7 Wheelchair access                                                   1      0       0       0         0
 8 Lifts to all floors                                                 1      0       0       0         0
 9 Fire detection                                                      1      0       0       0         0
10 Intruder alarm                                                      1      0       0       0         0
11 Village Location                                                    1      0       0       0         0
12 4 Bedrooms                                                          1      0       0       0         0
13 Gardens                                                             0      1       0       0         0
14 Balcony                                                             0      0       0       1         0
15 On street/residents parking                                         0      0       1       0         0
16 Central heating                                                     1      0       0       0         0
17 Double glazing                                                      1      0       0       0         0
18 Fireplace                                                           0      0       0       0         1
19 Rural/secluded                                                      1      0       0       0         0

does that create dummy variables or just separate out the words in the string? — Lucy Kingwell-Meader, Jul 12 '22 at 18:47
as you can see from the ouput, it extracts all the matched keywords. Why - do you want dummy vars for each keyword? — Chris Ruehlemann, Jul 12 '22 at 19:50
Yeah, sorry if I didn't make that clear - I'm looking for dummy variables for each keyword, exactly. Do you know how to do that? — Lucy Kingwell-Meader, Jul 12 '22 at 20:04
Chris, thank you so much for your edit! I tried it and came into some problems because it can only seem to deal with 1,000 rows at a time and missed out 62,000 rows. Any chance of another method that might do the same thing but do it for all of the data? It was after running data.frame(my_data) that it came into problems. When trying to run the rest of it to see what happens, it said that there was an unexpected '=' in the mutate code line — Lucy Kingwell-Meader, Jul 13 '22 at 13:26
Please post some of your data using `dput(head(YOURDATA))` in reproducible format inside the Question. Otherwise I won't be able to help you — Chris Ruehlemann, Jul 13 '22 at 16:26

Search for and extract words into new column

1 Answers1

Linked