0

I have a table as follows:

yel <- data.table(id=c(1,2,3))
yel$names[1] <- "\"parking space\", \"dining\", \"3bh\""
yel$names[2] <- "\"parking\" , \"outdoor\""
yel$names[3] <- "\"Hello!\",\"dining room\",\"3bh\""
yel

   id                            names
1:  1 "parking space", "dining", "3bh"
2:  2            "parking" , "outdoor"
3:  3     "Hello!","dining room","3bh"

I was to dummify the names variable and join the same words like parking space with parking and also dining room with dining.

I want the dummy variables as follows: parking , dining, 3bh , outdoor, hello. Is there any method that does this?

oguz ismail
  • 1
  • 16
  • 47
  • 69
Manish Ranjan
  • 45
  • 1
  • 5
  • The least well-defined bit seems to be *"join the same words like parking space with parking and also dining room with dining"* - with `parkingspace` and `diningroom` as the results. Can you articulate rules more exactly? Can we generalize that if there is a 2 word phrase, any entry that matches the first word should also get the second word, and then the space should be removed? Are there ever cases where the second word is different? What would happen if there was both `"parking space"` and `"parking lot"`? – Gregor Thomas Feb 20 '17 at 23:53
  • @Gregor well sorry for not being clear. I would like to rephrase it "join similar words like "parking space" and "parking lot" into "parking". Would this help? – Manish Ranjan Feb 21 '17 at 00:07
  • 2
    If the data are this simple, then you can just strip away everything after the first word. Maybe something like `library(splitstackshape); dcast(cSplit(yel, "names", ",", "long")[, names := gsub('\\"| .*', "", names)], id ~ names, fun.aggregate = length)`? – A5C1D2H2I1M1N2O1R2T1 Feb 21 '17 at 01:59
  • @A5C1D2H2I1M1N2O1R2T1 most of the data is that simple except few anomalies like there are words "roof top" "roof deck top" "roofdeck top" – Manish Ranjan Feb 21 '17 at 04:09
  • @ManishRanjan, then maybe you need to look at `agrep` or something along those lines. You should start with a list of words that you want to use as the dummies, and probably do some preliminary data cleaning to make the task easier. – A5C1D2H2I1M1N2O1R2T1 Feb 21 '17 at 04:47

1 Answers1

0

How about this (the regex may still need to be tweaked a bit-doesn't look like it is generalized enough). Using tidyr:

separate_rows(yel,names,sep=",")->df1
df1 %>% mutate(newnames=gsub('\\"| space|\\!| |room', "", names))

  id           names newnames
1  1 "parking space"  parking
2  1        "dining"   dining
3  1           "3bh"      3bh
4  2      "parking"   parking
5  2       "outdoor"  outdoor
6  3        "Hello!"    Hello
7  3   "dining room"   dining
8  3           "3bh"      3bh
thisisrg
  • 596
  • 3
  • 12