Partial Case-Insensitive String Search in R

Question

I have a spotify$gens column where it contains all the descriptions of genres of each album.

For example: head(spotify$gens) gives

gens = c("Jazz Fusion", "Latin Rock, Progressive Rock", "Progressive Rock", 
"Blues Rock, Electric Blues", "Electric Texas Blues, Electric Blues", "Piano Blues, Chicago Blues")

I want to use what I have made:

keyGenres = c("Pop","Rock","Hip Hop","Latin",
  "Dance","Electronic","R&B","Country","Folk",
  "Acoustic","Classical","Metal","Jazz","New Age",
  "Blues","World","Traditional")

to match the spotify$gens and return the matching part of the string string.

I have this code right now:

for (i in seq_along(spotify$gens)){
  for (genre in keyGenres){
    if( spotify$gens[i] %ilike% keyGenres[genre]){
       spotify$gens[i] <- keyGenres[genre]
    } else{
      spotify$gens[i] = spotify$gens[i]
    }}}

but it is returning me this error: Error in if (spotify$gens[i] %ilike% keyGenres[genre]) { : missing value where TRUE/FALSE needed

An example result i want would be spotify$gens[1] = "Jazz Fusion" to spotify$gens[1] = "Jazz"

Some albums have more than one genre and I want to return the first string that is matched only.

Can anyone help me out? Thank you!!

Sounds like you have some missing values in your `gens` column. `if()` doesn't like missing values. Try `if( spotify$gens[i] %ilike% keyGenres[genre] & !is.na(spotify$gens[i]) )` — Gregor Thomas, Aug 18 '22 at 02:16
@GregorThomas Didn't seem to do it for me :( There doesn't seem to be any missing values in `gens` — Melody, Aug 18 '22 at 02:24
Ah, simpler problem. You wrote your loop using `genre` as an integer, so you need the `seq_along` in `for (genre in seq_along(keyGenres))`. — Gregor Thomas, Aug 18 '22 at 02:33

score 0 · Answer 1 · answered Aug 18 '22 at 02:38

The problem with your loop using is that you're using genre as an integer, so you need the seq_along in for (genre in seq_along(keyGenres)):

for (i in seq_along(gens)){
  for (genre in seq_along(keyGenres)){
    if( gens[i] %ilike% keyGenres[genre]){
       gens[i] <- keyGenres[genre]
    } else{
      gens[i] = gens[i]
    }}}
gens
# [1] "Jazz"  "Rock"  "Rock"  "Rock"  "Blues" "Blues"

We can use str_replace_all which is vectorized, and allows for a vector of regex patterns and replacements to eliminate the loops. This will be much more efficient:

library(stringr)
pat_replace = setNames(keyGenres, paste0(".*", tolower(keyGenres), ".*"))
result = str_replace_all(tolower(gens), pattern = pat_replace)
result
# [1] "Jazz"  "Rock"  "Rock"  "Rock"  "Blues" "Blues"

Using this data:

gens = c("Jazz Fusion", "Latin Rock, Progressive Rock", "Progressive Rock", 
"Blues Rock, Electric Blues", "Electric Texas Blues, Electric Blues", "Piano Blues, Chicago Blues")
keyGenres = c("Pop","Rock","Hip Hop","Latin","Dance","Electronic","R&B","Country","Folk","Acoustic","Classical","Metal","Jazz","New Age","Blues","World","Traditional")

Partial Case-Insensitive String Search in R

1 Answers1