1

My question builds upon the topic of matching a string against multiple patterns. One solution discussed here is to use sapply(keywords, grepl, strings, ignore.case=TRUE) which yields a two-dimensional matrix.

However, I run into significant speed issues, when applying this approach to 5K+ keywords and 60K+ strings..(I cancelled the process after 12hrs).

One idea is to use hash tables, or environments in R. However, I don't get how "translate/convert" my strings into an environment while keeping the numerical index?

I have strings[1]... till strings[60000]

e <- new.env(hash=TRUE)

   for (i in 1:length(strings)) {
        assign(x=i, value=strings, envir=e)
    }

As x in assign must be a character, I can't use it like this, but I hope you get my idea..I want to be able to index the environment with the same numbers like in my string[...] vector

Thanks for your help!

Community
  • 1
  • 1
yumba
  • 1,056
  • 2
  • 16
  • 31
  • The lookup dictionary is what you put into the environment not your strings. So the keyword is what would be used to lookup. The hash (envir) lookup is a two column matrix/dataframe in which you lookup a and b is given. So the strings don't really go there. Also I'm guessing what's really slowing you down is the `grepl`. In any event this is not a reproducible example. Please post data and the code you've tried thus far. Please don't merely reference e previous question, give data for each question. – Tyler Rinker Sep 06 '13 at 15:50

1 Answers1

1

R environments are not used as much as perl hashes are, I think just because there are not widely understood 'idioms' for doing so. In your case the key question is, do you really want the numerical index? If so it should be the value. The key is your string, that's the whole point of the exercise.

e <- new.env(hash=T)
strings <- as.character(chickwts$feed) # note! not unique
sapply(1:length(strings), function(i)assign(strings[i], i, e))
e$horsebean   # returns 10

In this example only the last index associated with each string is kept, but you can assign anything that might be useful to each key, such as a vector of indices.

You can then lookup your data in a number of ways. You can regex search for keys using ls, for example, and retrieve the values using mget():

# find all keys containing 'beans'
ls(e, patt='bean')
# retrieve bean data
mget(ls(e, pat='bean'),e)
  • thanks for your reply! but I don't see how this can help me perform a lookup..I would need to apply the `grepl` on the value of the environment (= a character string) ..`sapply(keywords, grepl, e, ignore.case=TRUE)` does not work. – yumba Sep 06 '13 at 10:01
  • I have added a bit on looking up things in environments but I don't know enough about your requirements to know if it's helpful – Leo Schalkwyk Sep 18 '13 at 13:29