0

For context, I asked a question earlier today about matching company names with various variations against a big list with a lot of different company names by using the "stringdist" function from the stringdist package, in order to identify the companies in that big list. This is the question I asked.

Unfortunately, I have not been able to make any improvements to my code, which is why I'm starting to look away from stringdist and try something completely different.

I use Rstudio, and I've noticed that the internal search function in that program is much more effective:

enter image description here

As you can see by the picture, simply searching for the company name in the top right corner gives me the output that I'm looking for, such as the longer name "AMMINEX EMISSIONS..." and "AMMINEX AS".

However, in my previous attempt with the stringdist function (see the link to my previous question) I would get results like "LAMINEX" which are not at all relevant, but would appear before the more useful matches:

enter image description here

So it seems like using the algorithm that Rstudio uses is much more efficient in my case, however I'm not quite sure if it's possible to replicate this algorithm in code form, instead of having to manually search for each company.

Assuming I have a data frame that looks like this:

Company_list <- data.frame(Companies=c('AMMINEX', 'Microsoft', 'Apple'))

What would be a way for me to search for all 3 companies at the same time and get the same type of results in a data frame, like Rstudio does in the first image?

WoeIs
  • 1,083
  • 1
  • 15
  • 25

1 Answers1

0

From your description of which results are good or bad, it sounds like you like exact matches of a substring rather than things that are close on those distance measures. In that case you can imitate Rstudio's search function with grepl

library(tidyverse)
demo.df <- data.frame(name = paste(rep(c("abc","jkl","xyz"), each=4), sample(1:100,4*3)), limbs=1:4*3)
demo.df%>%filter(grepl('abc|xyz',name))

where the pipe in the grepl pattern string means 'or', letting you search for multiple companies at the same time. So, to search for the names from the example data frame this string would be paste0(Company_list$Companies,collapse="|") Is this what you're after?

steveLangsford
  • 646
  • 5
  • 9
  • Thanks for the suggestion! And yes, that is what I would like to do. I did some digging around with grep too and came to this command: greptest <- data.frame(grep('AMMINEX', Biglist$Person_name_clean, value=TRUE)) Can you tell me what the difference would be between my grep function, and your grepl function? – WoeIs Mar 31 '18 at 15:22
  • The return values are different. grepl returns logical values, a vector of true/false with the meaning matches/doesn't match. grep usually returns a vector of match indexes, but with value=TRUE it returns the values at those indexes. These are all useful things in different situations. I used the vector of true/false values here because that's a convenient thing to pass to filter. – steveLangsford Apr 01 '18 at 01:59