1

I have a column in a data frame that is not well defined. For example, I have :

 > mydata$column1<-c("abcAppledef","gsApple123hilhj","stBananaewfs","sfesBanana123sfeft",
"stwefPearsfet","stwfePearabcseft","wefCarwefeEef","wefwaCarWFEe","wefaCarEFWefe")

I would like to re-define the column by replacing the strings with wildcard, and the outcome should be something like:

> mydata$column1<-c("Apple","Apple","Banana","Banana","Pear","Pear","Car","Car","Car")

I am using

> mydata$column1<-gsub('.*Apple.*','Apple',mydata$column1)

> mydata$column1<-gsub('.*Banana.*','Banana',mydata$column1)

> mydata$column1<-gsub('.*Pear.*','Pear',mydata$column1)

> mydata$column1<-gsub('.*Car.*','Car',mydata$column1)

But I have many different kinds of patterns, and I would need to apply this on multiple tables as well. Is there a more efficient way to do this? Maybe a lookup table?

Thanks.

IvyZ
  • 11
  • 3
  • Do you have at the start the complete list of target strings? E.g., `targets = c("Apple", "Banana", "Pear")`? If so I would just change strings where they are present for that. `for (fruit in targets) {x[grep(fruit, x) <- fruit}`. – Gregor Thomas Jun 30 '17 at 16:02

1 Answers1

0

Using this Remove numbers from alphanumeric characters

gsub('[0-9]+', '', x)
gsub('st', '', x)

I am afraid you would need to clean it in a series of gsub to get 100% clean data

Ajay Ohri
  • 3,382
  • 3
  • 30
  • 60
  • Thanks. But 'st' was just an example, they don't really have patterns. I'll modify my quesiton. – IvyZ Jun 30 '17 at 16:03
  • you could invert the question..instead of looking at patterns to replace and clean, you could search for patterns in a few values(fruits) – Ajay Ohri Jun 30 '17 at 16:06