3

I have a large, dataset with ~ one million observations, keyed with a defined observation type. Within the dataset, there are ~900,000 observations with malformed observation types, with ~850 (incorrect) variations of the 50 acceptable observation types.

keys <- c("DAY", "EVENING","SUNSET", "DUSK","NIGHT", "MIDNIGHT", "TWILIGHT", "DAWN","SUNRISE", "MORNING")

entries <- c("Day", "day", "SUNSET/DUSK", "DAYS", "dayy", "EVEN", "Evening", "early dusk", "late day", "nite", "red dawn", "Evening Sunset", "mid-night", "midnight", "midnite","DAY", "EVENING","SUNSET", "DUSK","NIGHT", "MIDNIGHT", "TWILIGHT", "DAWN","SUNRISE", "MORNING")

Using gsub is akin to digging a basement with a hand shovel, and in my own case, a broken-handled shovel as I'm very new with r and the intricacies regular expressions. The simple fallback (for me) is to write one gsub statement for each of the accepted observation types but that seems unnecessarily arduous as it needs 50 statements.

I'd like to use levenshtein.distance or stringdist to replace the offending entries with the shortest distance string. Running z <- for (i in length(y)) { z[i] = levenshtein.distance(y[i], x)} doesn't work as it's trying to pass (length(x)) results to each y[i].

How do I return the result with the minimum distance? I've seen function(x) x[2] that returns the 2nd result in a series, but how to get the lowest?

Steven Beaupré
  • 21,343
  • 7
  • 57
  • 77
Andrew M
  • 101
  • 9
  • You might want to look at the documentation of `adist()`. – RHertel Oct 22 '15 at 15:09
  • You need to specify what match you think is correct when comparing both "SUNSET" and "DUSK" to "SUNSET/DUSK", – IRTFM Oct 22 '15 at 15:17
  • "SUNSET/DUSK" should evaluate to "SUNSET" with a distance method. The nature of the dataset prevents me from determining if "DUSK" or "SUNSET" is more appropriate. , – Andrew M Oct 22 '15 at 17:00

1 Answers1

4

You could try:

library(stringdist)
m <- stringdistmatrix(entries, keys, method = "lv")
a <- keys[apply(m, 1, which.min)]

If you want to experiment with different algorithm, have a look at ?'stringdist-metrics'


Or as per mentioned by @RHertel in the comments:

b <- keys[apply(adist(entries, keys), 1, which.min)]

From adist() documentation:

Compute the approximate string distance between character vectors. The distance is a generalized Levenshtein (edit) distance, giving the minimal possibly weighted number of insertions, deletions and substitutions needed to transform one string into another.

The two methods yield identical results:

> identical(a, b)
#[1] TRUE
Steven Beaupré
  • 21,343
  • 7
  • 57
  • 77
  • 1
    I cheered out loud and scared the dog! Thank you very much both of you! adist was exactly what I was looking for! HUGE SMILES. Thank you. – Andrew M Oct 22 '15 at 17:02
  • 1
    Beautiful and elegant solution! Thanks. – nd091680 Dec 19 '19 at 10:25
  • @ Steven Beaupré: great answer! Can you please take a look at this question if you have time? https://stackoverflow.com/questions/70231544/re-writing-fuzzy-join-functions-from-r-to-sql thank you! – stats_noob Dec 07 '21 at 11:12