1

I have a database with ~5,000 locality names, most of which are repetitions with typos, permutations, abreviations, etc. I would like to group them by similarity, to speed up further processing. The best would be to convert each variation into a "platonic form", and put two columns side by side, with the original and platonic forms. I've read about Multiple sequence alignment, but this seems to be mostly used in bioinformatics, for sequences of DNA/RNA/Peptides. I'm not sure it will work well with names of places. Anyone knows of a library that helps me to do it in R? Or which of the many algorithm variations might be easier to adapt?

EDIT: How do I do that in R? Up to now, I'm using adist() function, which gave me a matrix of distances between each pair of strings (although it don't treat translocations the way I think it should, see comment below). The next step I'm working right now is to turn this matrix into a grouping/clustering of similar enough values. Thanks in advance!

EDIT: To solve the translocations problem, I did a small function that gets all the words with more than 2 characters, sort them, removes any punctuation left, and paste them again into a string.

sep <- function(linha) {
    resp <- strsplit(linha," |/|-")
    resp <- unlist(resp)
    resp <- gsub(",|;|\\.","",resp)
    resp <- sort(resp[which(nchar(resp) > 2)])
    paste0(resp,collapse=" ")
}

Then I apply this over all lines of my table

locs[,9] <- apply(locs,1,function(x) sep(x[1])) # 1=original data; 9=new data

and finally apply adist() to create the similarity table.

Rodrigo
  • 4,706
  • 6
  • 51
  • 94
  • 1
    Possibly use a soundex algorithm to group names by sound. – mccainz Nov 13 '13 at 18:06
  • The names are in portuguese, I don't think it will work, since soundex is designed for english names, right? – Rodrigo Nov 13 '13 at 19:00
  • Spanish/Portugese generally work ok with soundex. Mileage may vary of course. Double metaphone may be more to your liking. (added link to a Brazilian portugese metaphone implementation) http://sourceforge.net/projects/metaphoneptbr/ – mccainz Nov 15 '13 at 13:37

1 Answers1

4

There's a built in function called "adist" that computes a measure of distance between two words.

It's like using "agrep", except it returns the distance, instead of whether the words match according to some approximate matching criteria.

For the special case of words that can be interchanged with a comma(e.g. "hello,world" should be close to "world,hello"), here's a quick hack. You can modify the function pretty easily if you have other special cases.

adist_special <- function(word1, word2){
    min(adist(word1, word2),
        adist(word1, gsub(word2, 
                          pattern = "(.*),(.*)", 
                          repl="\\2,\\1")))
}

adist("hello,world", "world,hello")

 # 8
adist_special("hello,world", "world,hello")

 # 0
kith
  • 5,486
  • 1
  • 21
  • 21
  • adist() did a fine job, @kith, thank you. But it don't treat translocations the way I expected. For instance, "Amazonas, Brasil" and "Brasil, Amazonas" should be considered very similar, but adist() gives a distance of 14! – Rodrigo Nov 13 '13 at 20:36
  • Thanks for the editing, but there are too much "special cases" for this to work (the words may be anywhere in the sentence). I'm gonna put my solution in an answer. – Rodrigo Nov 14 '13 at 13:57