Find indices of elements in a dataframe based on character comparison (using e.g. grepl)

Question

I have a large dataframe containing bank characteristica from which i want so filter certain banks based on their names. Assume df is my large dataframe, containing inter alia the bank names:

df <- data.frame(bank = c("BNP PARIBAS", "DEUTSCHE BANK AG", "HSBC HOLDINGS PLC", 
                          "BANCO SANTANDER SA", "BOCE GROUP", "BARCLAYS BANK PLC",
                          "HSBC INVEST PLC"))
> df
                bank
1        BNP PARIBAS
2   DEUTSCHE BANK AG
3  HSBC HOLDINGS PLC
4 BANCO SANTANDER SA
5         BOCE GROUP
6  BARCLAYS BANK PLC
7    HSBC INVEST PLC

A second dataframe names contains the names of all banks that i want to filter from df:

names <- data.frame(name = c("HSBC", "BANK OF AMERICA", "DEUTSCHE BANK", "BANK OF CANADA",
                              "COMMERZBANK", "SANTANDER"))
> names
             name
1            HSBC
2 BANK OF AMERICA
3   DEUTSCHE BANK
4  BANK OF CANADA
5     COMMERZBANK
6       SANTANDER

For each element in names i want to know the index of the rows in df that match for example the first 5 letters of the name. I imagined something like this

library(dplyr)
names %>%
  mutate(index = grep(substr(name, 1, 5), df$bank))

which results in

Error in mutate_impl(.data, dots) : 
  wrong result size (2), expected 6 or 1
In addition: Warning message:
In grep(substr(c(5L, 1L, 4L, 2L, 3L, 6L), 1, 5), df$bank) :
  argument 'pattern' has length > 1 and only the first element will be used

I can use grep for single elements, giving me the desired results, for the first row of namee.g.:

> grep(substr(names$name[1], 1, 5), df$bank)
[1] 3 7

How can get a solution that works for the whole vector name? Notice that some banks in bank have multiple entries (for different companies belonging to the same group), which i want to include all.

A quick and dirty solution could be `names %>% rowwise() %>% mutate(index = list(grep(substr(name, 1, 5), df$bank)))`. This puts the output of `grep` into a list, which avoids issues with different output sizes. — coffeinjunky, May 05 '17 at 11:03

score 2 · Accepted Answer · answered May 05 '17 at 11:27

In my experience,this is hard problem to fully resolve as bank names are tricky with various jurisdictions, short forms, entity names, M&A activities etc.

I have had moderate success with stringdist function from stringdist package as first pass filter to narrow the focus on more involved cases.

It also helps with excluding common names prefixes/suffixes of banks such as BANK|BANCO|BANKIA|BANC etc. and retain only unique portion of the bank names

#The specification `1 - stringdist("abc",c("abcd","acd"),method="jw",p=0.1)` gives simmilarity instead of distance
#see ?stringdist for examples


#acceptance simmilarity threshold, only cases above threshold will be retained
threshold = 0.7


mappedNames = names %>% 
group_by(name) %>%
mutate(index = paste0(which((1 - stringdist(name,gsub("BANK|BANCO","",DF$bank),method='jw',p=0.1))> threshold),collapse=",")) %>%
as.data.frame()

mappedNames
#            name index
#1            HSBC   3,7
#2 BANK OF AMERICA      
#3   DEUTSCHE BANK     2
#4  BANK OF CANADA      
#5     COMMERZBANK      
#6       SANTANDER     4

score 1 · Answer 2 · answered May 05 '17 at 11:10

1

sapply(names$name,function(x){grep(x,df$bank)}) or lapply(names$name,function(x){grep(x,df$bank)}) might be what you want.

answered May 05 '17 at 11:10

Gaurav Taneja

1,084
1
8
19

score 1 · Answer 3 · answered May 05 '17 at 11:37

An approach using fuzzyjoin. You can extend the function to add conditions. Here there is matching with 5 character substr and stringr::str_detect.

df <- data.frame(bank = c("BNP PARIBAS", "DEUTSCHE BANK AG", "HSBC HOLDINGS PLC", 
                          "BANCO SANTANDER SA", "BOCE GROUP", "BARCLAYS BANK PLC",
                          "HSBC INVEST PLC"), stringsAsFactors = FALSE)

names <- data.frame(name = c("HSBC", "BANK OF AMERICA", "DEUTSCHE BANK", "BANK OF CANADA",
                            "COMMERZBANK", "SANTANDER"), stringsAsFactors = FALSE)

library(fuzzyjoin)
library(stringr)
fuzzy_left_join(df, names, by = c("bank" = "name"), 
                 match_fun = function(x, y) {
                   substr(x,1,5) == substr(y,1,5) | str_detect(x, y)})

Find indices of elements in a dataframe based on character comparison (using e.g. grepl)

3 Answers3