6

I have a very messy data frame, with one column with values that are understandable to humans but not to computers, a bit like the one below.

df<-data.frame("id"=c(1:10), 
           "colour"=c("re d", ", red", "re-d","green", "gre, en", ", gre-en",  "blu e", "green", ", blue", "bl ue"))

I can filter the df with str_detect

df %>% filter(str_detect(tolower(colour), pattern = "gr")) 

But I want to rename all the filtered results to the same value so I can wrangle it.
Any suggestions?
I tried to separate with pattern but was unsuccessful.

EDIT: Not all . and spaces are unnecessary in the df that I am working with. Lets say that the correct way of writing green in the made up df is "gr. een".

EDIT2:
Wanted result with faked spelling of colours just to get an idea:

id     colour
1      r. ed
2      r. ed
3      r. ed
4      gr. een
6      gr. een
7      gr. een
8      blu. e
9      gr. een           
10     blu. e
Mactilda
  • 393
  • 6
  • 18
  • 3
    Why not (pre-) process the `colour` column and updated it to something that makes sense before you filter? Try this `df$colour = gsub("[^[:alnum:] ]", "", df$colour); df$colour = gsub(" ", "", df$colour)`, which will remove any non alphanumeric values and spaces. – AntoniosK Dec 21 '18 at 11:19
  • Or `df %>% mutate(colour2 = gsub("[^a-zA-Z]", "", colour))` – talat Dec 21 '18 at 11:21
  • ...also note that (a) for this example you don't need `tolower`, (b) the pattern you use won't pick a value like `g-reen`, or `g reen`. That's why I'd recommend pre-processing your column before you filter. – AntoniosK Dec 21 '18 at 11:23
  • @AntoniosK In the df that I am using not all non alphanumeric values and spaces are unnecessary. Unfortunately. – Mactilda Dec 21 '18 at 11:26
  • Then I guess you can pre-process the data as much as you can. Concerning your question above, what value do you want to assign to your filtered results? `gr. een`? – AntoniosK Dec 21 '18 at 11:35
  • @AntoniosK Yes. – Mactilda Dec 21 '18 at 11:54
  • I think `mgsub` is what you're looking for. I've edited my answer below. – AntoniosK Dec 21 '18 at 12:45

3 Answers3

4

You can use mgsub to replace multiple words with multiple patterns:

df<-data.frame("id"=c(1:10), 
               "colour"=c("re d", ", red", "re-d","green", "gre, en", 
                          ", gre-en",  "blu e", "green", ", blue", "bl ue"))

library(textclean)

df$colour = mgsub(df$colour, 
                  pattern =  c(".*gr.*", ".*re.*", ".*bl.*"), 
                  replacement =  c("gr. een", "r. ed", "blu. e"), fixed = F)

df

#     id  colour
# 1   1   r. ed
# 2   2   r. ed
# 3   3   r. ed
# 4   4 gr. een
# 5   5 gr. een
# 6   6 gr. een
# 7   7  blu. e
# 8   8 gr. een
# 9   9  blu. e
# 10 10  blu. e
AntoniosK
  • 15,991
  • 2
  • 19
  • 32
  • Thank you. I don't seem to get the `FilterAndUpdate`code to work (I changed d to df - as I thought that was the problem but it still didn't work). As I think you understand I need to save the changes so that I can go back to original df and do it with several combinations and then later plot it out. – Mactilda Dec 21 '18 at 12:11
  • No, that wasn't clear as you mentioned filtering. If you need to update your column based on some patterns you don't need to filter anything. You just need to update your column in your original dataset using a function that uses a vector with multiple values as pattern and a vector with multiple replacements. – AntoniosK Dec 21 '18 at 12:15
  • In the example you posted it looks like you want to update ONLY your filtered results and not your original dataset. You have to post your expected output to avoid any confusion. – AntoniosK Dec 21 '18 at 12:24
  • Thank you for helping me, I have edited the question with a table of what I want. If there is any other changes that can be made to clarify, please don't hesitate to tell me. – Mactilda Dec 21 '18 at 12:35
  • Yey! Thank you. – Mactilda Dec 21 '18 at 13:10
3

Here are two solution for pre-processing your data, one is given in the comments already:

library(dplyr)
df %>% 
  mutate(colour2 = gsub("[^A-z]", "", colour))%>%
  filter(str_detect(tolower(colour2), pattern = "green")) 

Making the inverse thinking, you can use stringr to extract the letters

library(stringr)

df %>% 
  mutate(colour2 = sapply(str_extract_all(df$colour,"[A-z]"),function(vec){paste0(vec,collapse = "")}))%>%
  filter(str_detect(tolower(colour2), pattern = "green")) 

Your selection will be more robust, and the column already renamed.

  id   colour colour2
1  4    green   green
2  5  gre, en   green
3  6 , gre-en   green
4  8    green   green
denis
  • 5,580
  • 1
  • 13
  • 40
  • Thank you. I need to keep some spaces and . Let's pretend that green is supposed to be written Gr. een – Mactilda Dec 21 '18 at 12:13
2

If you just want to rename all of the filtered results, how about:

df<-data.frame("id"=c(1:10), 
               "colour"=c("re d", ", red", "re-d","green", "gre, en", ", gre-en",  "blu e", "green", ", blue", "bl ue"))

library(stringr)                                                         
df[str_detect(tolower(df[,"colour"]), pattern = "gr"), "colour"] <- "green"
ira
  • 2,542
  • 2
  • 22
  • 36