Finding most similar phrases

Question

I have two data sets. one is suppose the repair description

Electric Component keyboard replacement

The second data set is all the repair descriptions for all the customers who had previous repair phrase and later had some repair description. Eg:

Electric Keyboard replace
Monitor Component Replacement
Mouse component
Wire Replacement
PIN part

so for this example I would like it to pick "Electric Keyboard replace" from second set as the most similair phrase to "Electric Component keyboard replacement"

DATA NAME;
INFILE DATALINES DSD; 
length FIRST $ 1000;
INPUT FIRST $;
DATALINES;
Electric Component keyboard replacement
;

DATA COMPONENT;
INFILE DATALINES DSD; 
length FIRST_B $ 1000;
INPUT FIRST_B $;
DATALINES;
Electric Keyboard replace
Monitor Component Replacement
Mouse component
Wire Replacement
PIN part
;

PROC SQL;
CREATE TABLE Possible_Matches AS
SELECT *
FROM Name AS n, COMPONENT AS b
WHERE (n.FIRST  =* b.FIRST_B);
QUIT;

It worked using sound like operator, I was excited. But When I tried this eg where I changed to "keyboard component replace" instead of "Electric Keyboard replace". It did not identify it gave me blank dataset. I tried "compare" too but was not able to achieve. I tried this approach as I saw some examples of names and email id correction or matching. But could the similair phrases be matched also using these functions? Is there any other solution to achieve this? Normally my matches will be rearranged words or extra words or shorter words(like replacement to replace)

COMPGED or SOUNDEX are two other options you can look at. https://www.sas.com/content/dam/SAS/support/en/sas-global-forum-proceedings/2018/2886-2018.pdf — Reeza, May 25 '18 at 15:10

score 1 · Answer 1 · answered Oct 03 '18 at 11:52

I managed to do something similar with name and adress using compged too! Create a dataset with the field you need to scan from your initial table so you have every record to be scanned multiplied by the types of repair you need to match So you end up with something like this (sorry cant make a table visual here): -Field 1 - Field 2 : Electric Component keyboard replacement - Electric Keyboard replace Electric Component keyboard replacement - Monitor Component Replacement Electric Component keyboard replacement - Mouse component Electric Component keyboard replacement - Wire Replacement Electric Component keyboard replacement - PIN part

From there you run compged on those too fields and it will give you an output in number for the match of all the words you are trying to get

compged(string1, string2);

Then make a rank on your result from compged and then you make a query on that table to fetch only the record with the lowest compged value

Note that it wont mean that your 2 sentences will be real match but it will keep only those who are most likely to match!

here is the doc on compged

Finding most similar phrases

1 Answers1