Fuzzy match only if exact match doesn’t exist

Question

I’m trying to write a function to get album data from Spotify’s API for a data frame of albums and artists. Because there are some misspellings in the dataset, I need to use a fuzzy matching function (like agrepl).

However, some artists, like Absu, have albums that are, by agrepl's standards, the same. For example, Absu has an album named “Absu” and another named “Abzu”. I only want the data for 1 of them, but I’ll end up with data for both. I know that you can change max.distance in agrepl, but I need it set fairly low to account for greater misspellings.

Is there a pre-built function or an easy way to tell R

if there is an exact match of album_name with mydata[["Album"]] filter and move on else: try and find a close match to filter?

Here’s something I’ve tried, but doesn’t work:

get_album_data <- function(x) {

  get_artist_audio_features(mydata$Artist[x], return_closest_artist = TRUE) %>% 
    ifelse(album_name %in% mydata$Album[x],
           filter(mydata$Album[x] == album_name,
           filter(agrepl(mydata$Album[x], album_name, ignore.case = TRUE))))

}

This is what my code looks like without trying anything special

library(dplyr)
library(spotifyr)
library(purrr)

# from Spotify's developer page
Sys.setenv(SPOTIFY_CLIENT_ID = "xxx")
Sys.setenv(SPOTIFY_CLIENT_SECRET = "xxx")
access_token <- get_spotify_access_token()

Artist <- c("Spiritualized", "Fleet Foxes", "The Avalanches", "Absu")
Album <- c("Sweet Heart, Sweet Light", "Helplessness Blues", "Wildflower", "Abzu")

mydata <- data_frame(Artist, Album)

get_album_data <- function(x) {
  get_artist_audio_features(mydata[["Artist"]][x], return_closest_artist = TRUE) %>% 
    filter(agrepl(mydata[["Album"]][x], album_name, ignore.case = TRUE)) %>%
    mutate(mydata[["Artist"]][x])
}

Any ideas? Thanks

Make a list of all the album names, all the possible misspellings (most obvious), then create a ternary tree to find them. — , Feb 28 '18 at 21:31
Can you provide few rows of data and the expected output ? That'll address the problem on target I think. — YOLO, Feb 28 '18 at 21:32
@sln Thanks for the response. `mydata` here is just an example of my actual data. I have a dataframe of over 1000 albums, so I can't do that. — Evan O., Feb 28 '18 at 21:32
@ManishSaraswat It's kind of hard to show the expected output here, but, with the artist "Absu" in particular, he has albums titled "Abzu" and "Absu". In `mydata` I only want "Abzu". However, with my use of `agrepl` I get the data for both "Abzu" and "Absu" since they're -- according to my function -- the same. Is that enough, or do you want me to post actual output? — Evan O., Feb 28 '18 at 21:42
@sln Yes, but there are some misspellings in them, so I need `agrepl`. I just want my function to move on if it finds a perfect match, and, if not, find the next closest match, if that makes any sense — Evan O., Feb 28 '18 at 21:43
How perfect does it need to be. If close is good enough I would just do a `dplyr::inner_join`/ `anti_join` to separate out the exact matches and then run your function on the remainder. The `stringdist` package has more tunable options for matching than `agrep`. — Ian Wesley, Feb 28 '18 at 22:26
Just use a Levenshtein distance and return a match if the distance is small and there are no ties. — AdamO, Feb 28 '18 at 22:49
Now that I think about it, I guess I really just need to return only the closest match. The issue is, with `agrepl` all matches that are within a certain distance are returned. Any easy ways to do that? — Evan O., Feb 28 '18 at 23:16

score 0 · Answer 1 · answered Jan 15 '19 at 01:48

Maybe you can first filter out those albums that have exact match.

artist_with_exact_matches = mydata$Artist[which(mydata$Artist %in% mydata$Album), ]
mydata_fuzzy_match = mydata[-which(mydata$Artist %in% artist_with_exact_matches), ]

Then use fuzzy match to find Artist and Album matches for the rest.

Fuzzy match only if exact match doesn’t exist

1 Answers1