4

This is my first post/question so be kind. I have a dataframe like this:

        id                             product
1 00109290                    Wax Salt; Pepper
2 23243242                          Wood Stuff
3 23242433   Magic Unicorn Powder and My Tears
4 23778899                             gelatin
5 25887766                                tin;
6  7786655             fart noises, and things
7  3432422 --spearmint bacon& hydrangia leaves

I have a lookup table like this:

        ingredients
1               wax
2              salt
3              wood
4          my tears
5    unicorn powder
6           gelatin
7               tin
8  hydrangia leaves
9         spearmint
10            bacon

I want to merge them on whole strings so I get this:

     id                             product      ingredients
1  00109290                    Wax Salt; Pepper              wax
2  00109290                    Wax Salt; Pepper             salt
3  23243242                          Wood Stuff             wood
4  23242433   Magic Unicorn Powder and My Tears         my tears
5  23242433   Magic Unicorn Powder and My Tears   unicorn powder
6  23778899                             gelatin          gelatin
7  25887766                                tin;              tin
8   3432422 --spearmint bacon& hydrangia leaves hydrangia leaves
9   3432422 --spearmint bacon& hydrangia leaves        spearmint
10  3432422 --spearmint bacon& hydrangia leaves            bacon

Instead I get this (notice line 7 not wanted):

         id                             product      ingredients
1  00109290                    Wax Salt; Pepper              wax
2  00109290                    Wax Salt; Pepper             salt
3  23243242                          Wood Stuff             wood
4  23242433   Magic Unicorn Powder and My Tears         my tears
5  23242433   Magic Unicorn Powder and My Tears   unicorn powder
6  23778899                             gelatin          gelatin
7  23778899                             gelatin              tin
8  25887766                                tin;              tin
9   3432422 --spearmint bacon& hydrangia leaves hydrangia leaves
10  3432422 --spearmint bacon& hydrangia leaves        spearmint
11  3432422 --spearmint bacon& hydrangia leaves            bacon

I am so painfully close, but I am matching incorrectly to 'gelatin' with 'tin'. I want to match whole words, not parts of words. I've tried many different techniques, the closest that gets met there is this:

library(sqldf)
id <- c('00109290', '23243242', '23242433', 
        '23778899', '25887766', '7786655', 
        '3432422')
product <- c('Wax Salt; Pepper', 'Wood Stuff', 
             'Magic Unicorn Powder and My Tears', 
             'gelatin', 'tin;', 'fart noises, and things', 
             '--spearmint bacon& hydrangia leaves')

ingredients <- c('wax', 'salt', 'wood', 'my tears', 
                 'unicorn powder', 'gelatin', 'tin', 
                 'hydrangia leaves', 
                 'spearmint', 'bacon') 

products <- data.frame(id, product)
ingred <- data.frame(ingredients)    
new_df <- sqldf("SELECT * from products 
                 join ingred on product LIKE '%' || ingredients || '%'")

Truly appreciate any advice. Perhaps a completely different approach is needed? I also welcome advice on the quality of the question, it's my first so you'd better set me straight right away.

DieselBlue
  • 137
  • 6
  • 2
    What is a whole word for you? A chunks of characters between spaces? – Wiktor Stribiżew Jun 23 '17 at 20:22
  • Really great question. Because if it was characters between white spaces then I would not match 'salt' to 'salt;', and my dataset has that type of thing. I do not want to match within a word. So 'salt' should match 'salt;' but not 'salty' and 'tin' should match 'tin' or ';tin' but not 'gelatin'. With your one question I realize...this is most likely a regex thing and I don't think I can use sqldf. Thank you! – DieselBlue Jun 23 '17 at 20:29
  • 2
    I have checked the [docs](https://cran.r-project.org/web/packages/sqldf/sqldf.pdf) but cannot find any mentioning that it supports a regex. In MySQL, `REGEXP` operator is present, and to mark word boundaries, `[[:<:]]` and `[[:>:]]` are used. In PostgreSQL, `\y` is used. In regex, `\b` marks a word boundary. Unfortunately, I do not think this is feasible with `sqldf`. – Wiktor Stribiżew Jun 23 '17 at 20:34
  • @WiktorStribiżew I think by keeping existing solution with LIKE, and adding a step with regex to filter out *wrong* matches could work? – zx8754 Jun 23 '17 at 20:37
  • 2
    Can you make a hopefully short list of non-space things which are not part of words? Your sample input seem to only contain ";" on that list. I would add "," and "&" definitly. – Yunnosch Jun 23 '17 at 21:44
  • I have modified the post to include more variations with "--", "," and "&". Thanks for the recommendation Yunnosch, I'll keep this in mind for the future. – DieselBlue Jun 23 '17 at 22:15

2 Answers2

3

A solution using the fuzzyjoin package, and str_detect from stringr:

library(fuzzyjoin)
library(stringr)

f <- function(x, y) {
  # tests whether y is an ingredient of x
  str_detect(x, regex(paste0("\\b", y, "\\b"), ignore_case = TRUE))
}

fuzzy_join(products, 
           ingred, 
           by = c("product" = "ingredients"), 
           match_fun = f)
#         id                           product    ingredients
# 1   109290                  Wax Salt; Pepper            wax
# 2   109290                  Wax Salt; Pepper           salt
# 3 23243242                        Wood Stuff           wood
# 4 23242433 Magic Unicorn Powder and My Tears       my tears
# 5 23242433 Magic Unicorn Powder and My Tears unicorn powder
# 6 23778899                           gelatin        gelatin

Data

products <- read.table(text = "
        id                             product
1 00109290                  'Wax Salt; Pepper'
2 23243242                        'Wood Stuff'
3 23242433 'Magic Unicorn Powder and My Tears'
4 23778899                             gelatin                  
  ", stringsAsFactors = FALSE)

ingred <- read.table(text = "
       ingredients
1              wax
2             salt
3             wood
4       'my tears'
5 'unicorn powder'
6          gelatin
7              tin
  ", stringsAsFactors = FALSE)

Scarabee
  • 5,437
  • 5
  • 29
  • 55
1

Consider adding OR conditions for one space before or after the key word and then exact match and with a replacement of any special characters/punctuation.

new_df <- sqldf("SELECT * from products 
                 join ingred on Replace(product, ';', '') LIKE '% ' || ingredients || '%'
                             OR Replace(product, ';', '') LIKE '%' || ingredients || ' %'
                             OR Replace(product, ';', '') = ingredients
                ")

You can even UNION for different special characters. Below example replaces semicolon and exclamation point:

new_df <- sqldf("SELECT * from products 
                 join ingred on Replace(product, ';', '') LIKE '% ' || ingredients || '%'
                             OR Replace(product, ';', '') LIKE '%' || ingredients || ' %'
                             OR Replace(product, ';', '') = ingredients
                 UNION    
                 SELECT * from products 
                 join ingred on Replace(product, '!', '') LIKE '% ' || ingredients || '%'
                             OR Replace(product, '!', '') LIKE '%' || ingredients || ' %'
                             OR Replace(product, '!', '') = ingredients
                ")

And for many UNIONs, consider having R concatenate SQL statements:

sql <- paste(lapply(c("!", "#", "$", "%", "(", ")", ":", ";", ".", "?", ">", "<", "/", "\\\\", "|"), 
             function(i)
                paste0("SELECT * from products 
                        join ingred on Replace(product, '", i, "', '') LIKE '% ' || ingredients || '%'
                                    OR Replace(product, '", i, "', '') LIKE '%' || ingredients || ' %'
                                    OR Replace(product, '", i, "', '') = ingredients
                       ")
       ), collapse = "UNION ")

cat(paste(sql))

SELECT * from products 
                   join ingred on Replace(product, '!', '') LIKE '% ' || ingredients || '%'
                             OR Replace(product, '!', '') LIKE '%' || ingredients || ' %'
                             OR Replace(product, '!', '') = ingredients
                   UNION SELECT * from products 
                   join ingred on Replace(product, '#', '') LIKE '% ' || ingredients || '%'
                             OR Replace(product, '#', '') LIKE '%' || ingredients || ' %'
                             OR Replace(product, '#', '') = ingredients
                   UNION SELECT * from products 
                   join ingred on Replace(product, '$', '') LIKE '% ' || ingredients || '%'
                             OR Replace(product, '$', '') LIKE '%' || ingredients || ' %'
                             OR Replace(product, '$', '') = ingredients
                   UNION SELECT * from products 
                   join ingred on Replace(product, '%', '') LIKE '% ' || ingredients || '%'
                             OR Replace(product, '%', '') LIKE '%' || ingredients || ' %'
                             OR Replace(product, '%', '') = ingredients
                   UNION SELECT * from products 
                   join ingred on Replace(product, '(', '') LIKE '% ' || ingredients || '%'
                             OR Replace(product, '(', '') LIKE '%' || ingredients || ' %'
                             OR Replace(product, '(', '') = ingredients
                   UNION SELECT * from products 
                   join ingred on Replace(product, ')', '') LIKE '% ' || ingredients || '%'
                             OR Replace(product, ')', '') LIKE '%' || ingredients || ' %'
                             OR Replace(product, ')', '') = ingredients
                   UNION SELECT * from products 
                   join ingred on Replace(product, ':', '') LIKE '% ' || ingredients || '%'
                             OR Replace(product, ':', '') LIKE '%' || ingredients || ' %'
                             OR Replace(product, ':', '') = ingredients
                   UNION SELECT * from products 
                   join ingred on Replace(product, ';', '') LIKE '% ' || ingredients || '%'
                             OR Replace(product, ';', '') LIKE '%' || ingredients || ' %'
                             OR Replace(product, ';', '') = ingredients
                   UNION SELECT * from products 
                   join ingred on Replace(product, '.', '') LIKE '% ' || ingredients || '%'
                             OR Replace(product, '.', '') LIKE '%' || ingredients || ' %'
                             OR Replace(product, '.', '') = ingredients
                   UNION SELECT * from products 
                   join ingred on Replace(product, '?', '') LIKE '% ' || ingredients || '%'
                             OR Replace(product, '?', '') LIKE '%' || ingredients || ' %'
                             OR Replace(product, '?', '') = ingredients
                   UNION SELECT * from products 
                   join ingred on Replace(product, '>', '') LIKE '% ' || ingredients || '%'
                             OR Replace(product, '>', '') LIKE '%' || ingredients || ' %'
                             OR Replace(product, '>', '') = ingredients
                   UNION SELECT * from products 
                   join ingred on Replace(product, '<', '') LIKE '% ' || ingredients || '%'
                             OR Replace(product, '<', '') LIKE '%' || ingredients || ' %'
                             OR Replace(product, '<', '') = ingredients
                   UNION SELECT * from products 
                   join ingred on Replace(product, '/', '') LIKE '% ' || ingredients || '%'
                             OR Replace(product, '/', '') LIKE '%' || ingredients || ' %'
                             OR Replace(product, '/', '') = ingredients
                   UNION SELECT * from products 
                   join ingred on Replace(product, '\\', '') LIKE '% ' || ingredients || '%'
                             OR Replace(product, '\\', '') LIKE '%' || ingredients || ' %'
                             OR Replace(product, '\\', '') = ingredients
                   UNION SELECT * from products 
                   join ingred on Replace(product, '|', '') LIKE '% ' || ingredients || '%'
                             OR Replace(product, '|', '') LIKE '%' || ingredients || ' %'
                             OR Replace(product, '|', '') = ingredients
Parfait
  • 104,375
  • 17
  • 94
  • 125
  • Oh my...why didn't I think of this? My first post/question and now I'm embarrassed. Thanks so much. – DieselBlue Jun 23 '17 at 21:39
  • 2
    I think "gelatin " is like "%tin %" (between first and second "OR"). And it would be problematic for "gelatin and salt". – Yunnosch Jun 23 '17 at 21:41
  • 2
    @DieselBlue I actually asked about it in my first comment. I would have posted this answer if you did not mention you need to match `tin` in `tin;`. – Wiktor Stribiżew Jun 23 '17 at 21:46
  • @Wiktor Stribiżew ... you are correct this won't match 'tin;' and dziękuję za pomoc. – DieselBlue Jun 23 '17 at 22:25
  • 1
    @DieselBlue ... see update where you can use `Replace` to remove special characters like `;`. – Parfait Jun 24 '17 at 14:35