1

I have a table with misspelling words. I need to correct those using from the words more similar to that one, the one that have more frequency.

For example, after I run

aggregate(CustomerID ~ Province, ventas2, length)

I get

1                             
2                     AMBA         29
    3                   BAIRES          1
    4              BENOS AIRES          1

    12            BUENAS AIRES          1

    17           BUENOS  AIRES          4
    18            buenos aires          7
    19            Buenos Aires          3
    20            BUENOS AIRES      11337
    35                 CORDOBA       2297
    36                cordoba           1
    38               CORDOBESA          1
    39              CORRIENTES        424

So I need to replace buenos aires, Buenos Aires, Baires, BUENOS AIRES, with BUENOS AIRES but AMBA shouldn't be replaced. Also CORDOBESA and cordoba should be replaced by CORDOBA, but not CORRIENTES.

How can I do this in R?

Thanks!

GabyLP
  • 3,649
  • 7
  • 45
  • 66
  • 2
    You can measure the string distances using for example `adist` function (or other methods in package stringdist), group the strings which are more similar (using a threshold on the distance) and then replace all the strings in the group with the string in the group having the greatest frequency. Problem is, this approach is really influenced by the distance calculation and the threshold chosen. For example, note that for probably all the string-distance methods, AMBA is more similar to BAIRES than BUENOS AIRES... – digEmAll Sep 09 '14 at 20:37
  • Could you please post the code to do so? – GabyLP Sep 09 '14 at 20:39

2 Answers2

3

Here's a possibile solution.

Disclaimer :
This code seems to works fine with your current example. I don't assure that the current parameters (e.g. cut height, cluster agglomeration method, distance method etc.) will be valid for your real (complete) data.

# recreating your data
data <- 
read.csv(text=
'City,Occurr
AMBA,29
BAIRES,1
BENOS AIRES,1
BUENAS AIRES,1
BUENOS  AIRES,4
buenos aires,7
Buenos Aires,3
BUENOS AIRES,11337
CORDOBA,2297
cordoba,1
CORDOBESA,1
CORRIENTES,424',stringsAsFactors=F)


# simple pre-processing to city strings:
# - removing spaces
# - turning strings to uppercase
cities <- gsub('\\s+','',toupper(data$City))

# string distance computation
# N.B. here you can play with single components of distance costs 
d <- adist(cities, costs=list(insertions=1, deletions=1, substitutions=1))
# assign original cities names to distance matrix
rownames(d) <- data$City
# clustering cities
hc <- hclust(as.dist(d),method='single')

# plot the cluster dendrogram
plot(hc)
# add the cluster rectangles (just to see the clusters) 
# N.B. I decided to cut at distance height < 5
#      (read it as: "I consider equal 2 strings needing
#       less than 5 modifications to pass from one to the other")
#      Obviously you can use another value.
rect.hclust(hc,h=4.9)

# get the clusters ids
clusters <- cutree(hc,h=4.9) 
# turn into data.frame
clusters <- data.frame(City=names(clusters),ClusterId=clusters)

# merge with frequencies
merged <- merge(data,clusters,all.x=T,by='City') 

# add CityCorrected column to the merged data.frame
ret <- by(merged, 
          merged$ClusterId,
          FUN=function(grp){
                idx <- which.max(grp$Occur)
                grp$CityCorrected <- grp[idx,'City']
                return(grp)
              })

fixed <- do.call(rbind,ret)

Result :

> fixed
              City Occurr ClusterId CityCorrected
1             AMBA     29         1          AMBA
2.2         BAIRES      1         2  BUENOS AIRES
2.3    BENOS AIRES      1         2  BUENOS AIRES
2.4   BUENAS AIRES      1         2  BUENOS AIRES
2.5  BUENOS  AIRES      4         2  BUENOS AIRES
2.6   buenos aires      7         2  BUENOS AIRES
2.7   Buenos Aires      3         2  BUENOS AIRES
2.8   BUENOS AIRES  11337         2  BUENOS AIRES
3.9        cordoba      1         3       CORDOBA
3.10       CORDOBA   2297         3       CORDOBA
3.11     CORDOBESA      1         3       CORDOBA
4       CORRIENTES    424         4    CORRIENTES

Cluster Plot :

enter image description here

digEmAll
  • 56,430
  • 9
  • 115
  • 140
  • It's what I need but I get an error in the last part (in # add CityCorrected column to the merged data.frame): **Error in `$<-.data.frame`(`*tmp*`, "CityCorrected", value = integer(0)) : replacement has 0 rows, data has 4** – GabyLP Sep 10 '14 at 00:02
  • @GabyP: maybe for some City you have NAs in the frequency column. The code inside the `by` function does not handle that case – digEmAll Sep 10 '14 at 06:55
  • I don't have NA in Occurr because is a frequency. – GabyLP Sep 10 '14 at 12:28
  • For some reason, grp (that is a subset of merged having the same cluster id) gives some problems. So please, add a try-catch in the function(grp){ ... } and print `grp` in case of error to verify what's the problem... – digEmAll Sep 10 '14 at 15:21
  • I didn't use the last part. Instead I wrote: **merged<- merged[with(merged, order(ClusterId)), ] merged1<-data.table(merged) merged2<-merged1[,which.max(frecuencia), by = ClusterId] tot<-merge(x=merged, y=merged2, by= "ClusterId", all.x=TRUE) f=function (tot1){ n=tot1$Nacionality tot1$NacCorregida<-n[tot1$V1] tot1}** library(plyr) totales<-ddply (tot,.(ClusterId), f) – GabyLP Sep 10 '14 at 15:26
  • @GabyP : good so you've solved the problem ? well, the important thing is the clustering part afterall, the replacement code is much easier :) – digEmAll Sep 10 '14 at 16:13
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/60993/discussion-between-gaby-p-and-digemall). – GabyLP Sep 10 '14 at 16:56
0

Here's my small replication of your aggregate result You'll need to change all the calls to data frames to fit whatever the structure of your data is.

df
#output
#       word freq
#1         a    1
#2         b    2
#3         c    3

#find the max frequency
mostFrequent<-max(df[,2])  #doesn't handle ties

#find the word we will be replacing with
replaceString<-df[df[,2]==mostFrequent,1]
#[1] "c"

#find all the other words to be replaced
tobereplaced<-df[df[,2]!=mostFrequent,1]
#[1] "a" "b"

Now say you have the following dataframe which contains your entire dataset, I'll just replicate a single column with words.

totalData
 #    [,1]
 #[1,] "a" 
 #[2,] "c" 
 #[3,] "b" 
 #[4,] "d" 
 #[5,] "f" 
 #[6,] "a" 
 #[7,] "d" 
 #[8,] "b" 
 #[9,] "c" 

We can replace all the words we want to replace, with the string we want to replace them with, by the following call

totaldata[totaldata%in%tobereplaced]<-replaceString
 #    [,1]
 #[1,] "c" 
 #[2,] "c" 
 #[3,] "c" 
 #[4,] "d" 
 #[5,] "f" 
 #[6,] "c" 
 #[7,] "d" 
 #[8,] "c" 
 #[9,] "c"

As you can see, all a's and b's have been replaced with c, where the other words are the same

DMT
  • 1,577
  • 10
  • 16
  • I don't need to replace all the lines with the same most frequent value. I will have many different groups. For example in this case AMBA shouldn't be replaced. – GabyLP Sep 09 '14 at 20:10
  • @GabyP What separates AMBA from being included? Obviously it's probably not a misspelling of Buenos Aires, but I can only work off of what you provided above. Do you also need a method for grouping together words that are "similar"? – DMT Sep 09 '14 at 20:30
  • yes exactly and then replace them by the word inside the group that has the biggest frequency – GabyLP Sep 09 '14 at 20:32
  • @GabyP Could you add a more complete picture of your aggregate result to your question then? Right now I'd just group together all words by their first letter, but I'm sure you have more words than that – DMT Sep 09 '14 at 20:34
  • Ok I just added more data from the results – GabyLP Sep 09 '14 at 20:37
  • @GabyP I'd go with digEmAll's comment on your question. From what you provided there doesn't seem to be a clear way to group (does CORRIENTES map to CORDOBA? Are there other groups that are closely related?). Using that adist function, you could form groups, then use my above code to replace mispellings. Depending on the words you have, the grouping may not be perfect. You could form your own rule to group, as you are the one with the best knowledge of the data – DMT Sep 09 '14 at 20:45