Clustering similar strings in a big dataset

Question

My data is similar to the following one

             comp_name                                            perm_id
        GM Global Technologies Operations LLC                      16002
        GM Global Technologies Operations, Inc.                     NA
 International Business Machines Corporation (IBM)                 87001
 International Business Machines Corp (IBM)                         NA

In sum, I may have similar comp_name strings, though one (or more than one) of them is with missing perm_ids. I want to fill these NAs using the strings with filled perm_ids. Kindly note that the data is larger than 400k rows.

score 1 · Answer 1 · answered Apr 03 '20 at 10:55

I think it depends on the actual structure of the data. In your example, the first part of the string is the same for each company. So you could make groups according to the first part of the string. But this doesn't work if a part in the middle or at the beginning is different. Anyway, as for your example, the code below works:

library(tidyverse)
    d <- data.frame(
      comp_name = c("GM Global Technologies Operations LLC",
                    "GM Global Technologies Operations, Inc.",
                    "International Business Machines Corporation (IBM)",
                    "International Business Machines Corp (IBM)"),
      value = c(1, NA, 2, NA))
    d %>% 
      mutate(comparison = substr(comp_name, 1, 20)) %>% 
      arrange(comp_name) %>% 
      group_by(comparison) %>% 
      mutate(new_value = max(value, na.rm = TRUE))
    #> # A tibble: 4 x 4
    #> # Groups:   comparison [2]
    #>   comp_name                               value comparison        new_value
    #>   <fct>                                   <dbl> <chr>                 <dbl>
    #> 1 GM Global Technologies Operations LLC       1 GM Global Techno…         1
    #> 2 GM Global Technologies Operations, Inc.    NA GM Global Techno…         1
    #> 3 International Business Machines Corp (…    NA International Bu…         2
    #> 4 International Business Machines Corpor…     2 International Bu…         2

^{Created on 2020-04-03 by the reprex package (v0.3.0)}

thank you very much! As you said, a part in the middle or at the beginning may be different in the dataset. — Enes, Apr 03 '20 at 11:04
Are the entries ordered, i.e. are the different company names next to each other in the dataset? — MKR, Apr 03 '20 at 14:26

Clustering similar strings in a big dataset

1 Answers1