0

I am new to R, and want to compare 2 strings(addresses) where

  1. Word order could be different, other than numbers. (Consecutive numbers need to be in same order)

  2. Words could be at times in short form, eg street could be st., North West could be North W.

  3. 1 string could contain a word or 2 extra(rest of the words would be same)

  4. There sometimes could be space in a word in 1 of the srings eg Pitampura -> Pitam pura.

    eg

S1 = QU 23/24 Shalimar Bagh, Pitampura, Street no. 22, delhi

S2 = QU Flat 23/24 Pitam Pura, St. No. 22, Shalimar Bagh, Delhi

So far, I have removed the special characters, whitespaces, redundant words in the address.

Would a distance formula like cosine or levenshtein distance, be a good choice. If yes, how to apply the same in R without using any package.

Don't have liberty to install any external package.

Thanks in advance.

Community
  • 1
  • 1
Aditya Kuls
  • 115
  • 1
  • 7
  • Condition number 4 can be relaxed, if not implementable with ease. Also, solution in any other language (eg. Python) is also welcomed. – Aditya Kuls Apr 18 '18 at 04:39

1 Answers1

1

Not a direct answer but an idea: you could calculate a score of the splitted lowercase words which occur in the other vector and establish some kind of threshold. In R this could be:

S1 <- "QU 23/24 Shalimar Bagh, Pitampura, Street no. 22, delhi"
lcwords1 <- tolower(unlist(strsplit(S1, " ")))

S2 <- "QU Flat 23/24 Pitam Pura, St. No. 22, Shalimar Bagh, Delhi"
lcwords2 <- tolower(unlist(strsplit(S2, " ")))

(score <- sum(lcwords1 %in% lcwords2)/length(lcwords1) + 
          sum(lcwords2 %in% lcwords1)/length(lcwords2)) / 2

And would yield a score of

[1] 0.7070707

where 1 would be equal vectors.
You'd very likely need to wrap this in a function which would yield a result, see a similar post here.

Jan
  • 42,290
  • 8
  • 54
  • 79
  • Thanks @Jan. Provided solution works well for Condition 1 & 3, but would have to take care of 2, 4(& spelling mistakes) separately I think. Will try using a dictionary for Condition 2, and regex for Condition 4, combined with another layer of distance formula(will have to struggle for regex & this one, I think) for separate words to allow flexibility with spelling errors. Thanks. – Aditya Kuls Apr 18 '18 at 07:10