0

I am trying to find the most suitable method (that I could use in VBA) in order to compare thousands of records from column A to data in column B.

The example of the data can be seen below:

Column1                                Column2
Modra Digest (DC)                      Oldstewart2
South West Local /Sunday Times (new)   Oldstewart
OldStewart political print             Saigon Last month Saigon Last month
Oldstewart2 Local print (Former)       Modra Digest Velehrad Digest (used via Bembek) 
Saigon Last month                      South West Local South West Local /Sunday Times

This is what the table looks like in Excel:

So far, I have tried the Fuzzy lookup (Excel add in) but it doesn't do the job exactly right. I have also tried the Levenshtein Distance method, which seems to work relatively fine once few tweaks are implemented (trim, removal of brackets, de-duplication of words in the string), however, the challenge appears when there is only one matching word in the string (despite it being a "keyword" - please see the "oldstewart" example above), as this method compares the number of changes that need to take place (edit, deletion, replacement etc.) in order to match a string to one another.

The desired result would ideally look somewhat like this, as you can see the matching percentage is lower for the above mentioned examples due to stipulated reasons. (Sorry about the pics, but I am struggling post the table in the same structure as I managed in the original post.*)

The desired output:

*Kindly note that this is not a cross-post, as I was advised to break this project into smaller pieces for better convenience. The broader picture regarding what I'm trying to achieve (incl. the function defined for the Levenshtein distance method) can be found here: String matching in VBA using a predefined function.

Any advice regarding the methodology etc. is highly appreciated - as someone may have dealt with something similar in the past.

Thank you VERY much for your help!

GSerg
  • 76,472
  • 17
  • 159
  • 346
JAYK
  • 15
  • 4

2 Answers2

2

I don't know much about this topic, but fuzzy matching in Power Query, with a threshold of 0.50, followed by a remove duplicates, seems to work for your data example. Might require some tweaking for your real data.

  • Make the two columns two different tables

  • Data/Get & Transform/From Table/Range for each table

  • Then, from the PQ UI, choose to Merge Queries

    • I chose Left OuterJoin
    • Use Fuzzy Matching
    • Threshold of 0.5
  • Select the first column and Remove Duplicates

All of the above steps can be done from the Power Query UI

Results

enter image description here

You could also do this by converting the two columns into a single table, and entering the code below into the Advanced Editor:

The code splits the table into two tables, and then uses the FuzzyJoin/Remove Duplicates method to combine them.

let
    Source = Excel.CurrentWorkbook(){[Name="Table10"]}[Content],
    tbl1 = Table.FromList(Source[Column1], Splitter.SplitByNothing(), null, null, ExtraValues.Error),
    #"Changed Type1" = Table.TransformColumnTypes(tbl1,{{"Column1", type text}}),
        tbl2 = Table.FromList(Source[Column2], Splitter.SplitByNothing(), null, null, ExtraValues.Error),
    #"Changed Type" = Table.TransformColumnTypes(tbl2,{{"Column1", type text}}),
        tbl3 = Table.FuzzyNestedJoin(#"Changed Type1","Column1",#"Changed Type","Column1","Matched",JoinKind.LeftOuter,
            [IgnoreCase = true, IgnoreSpace= false, Threshold = 0.5]),
    #"Expanded Matched" = Table.ExpandTableColumn(tbl3, "Matched", {"Column1"}, {"Matched"}),
    #"Removed Duplicates" = Table.Distinct(#"Expanded Matched", {"Column1"})
in
    #"Removed Duplicates"
Ron Rosenfeld
  • 53,870
  • 7
  • 28
  • 60
  • Thank you so much for the above, just about to give it a try! Is there any way to write a macro that does this with a click of a button or are Macros not compatible with the Power Query at all? – JAYK Jul 14 '20 at 08:37
  • @JAYK Just write your macro to click the `Refresh` – Ron Rosenfeld Jul 14 '20 at 09:38
  • Thanks for the tip! would that work even if I wanted to insert any data into column A/B and then follow the entire procedure as described above (e.g. the number of records differs by thousands) etc., however the matching technique would remain the same. Thanks a bunch – JAYK Jul 14 '20 at 10:29
  • Hi Ron, when running the code, I get the following error: Expression.Error: The column 'Column1' of the table wasn't found. Let me just describe the steps that I take: 1) I paste the data to match in column A and Column B. 2)I click on Get data from table range and select A:B as a table range. 3) I insert your code in the Advanced editor and click done.. Any idea how to define the columns so it matches your code? Thank you very much! Also, just wanted to ask what is the output for those records that do not match (as I can see that the matching records are placed next to one another)? Thank you – JAYK Jul 14 '20 at 12:33
  • @JAYK On the worksheet, what are the column names? When you see the error message in the PQ UI, there should be code exposed in the area above. Can you copy/paste that code line into a comment here? – Ron Rosenfeld Jul 14 '20 at 13:55
  • @JAYK What happens when records don't match is determined by the value of the `JoinKind` parameter. I used `JoinKind.LeftOuter`, which ensures that all rows of the first table appear in the result. But there are six different values for that parameter (do a search for PowerQuery JoinKind) and you can specify various behaviors. – Ron Rosenfeld Jul 14 '20 at 20:43
  • Thank you Ron. I have managed to get get everything working, but still find this tool a little bit insufficient to distinguish between records..for example, when I set the "sensitivity" of fuzzy match to value of 0.65, some of the matches like e.g. Mississuaga vs. Mississippi Sports OR daily mirror (Russia) vs. Daily financial times (Russia) are still picked up as a match-which is not supposed to be... when I increase the sensitivity too high, the other matches that I mentioned above (in the initial post) will be no longer picked up...is there any way to tweak this tool to cater for both? – JAYK Jul 15 '20 at 19:04
  • @JAYK Not that I know of. There are plenty of non-Excel fuzzy matching tools in the universe. Perhaps some of them are more accurate, but you'd have to investigate that. – Ron Rosenfeld Jul 16 '20 at 01:04
  • Thanks again, Ron, for all your help! I am thinking that due to diversity of input, some records may still require manual review anyway, however, this solution already saves plenty of time! BTW, before I close this topic, would it be possible to compare multiple columns using this technique? Say I would like to find a match of an item in column D within column C,B and A? Could you please suggest how to adjust the current set-up? Thanks again - it is much appreciated. – JAYK Jul 16 '20 at 08:20
  • @JAYK I'd try making columns C, B and A into a single column. There's no requirement that there be the same number of entries in column D. – Ron Rosenfeld Jul 16 '20 at 10:01
1

I had a similar issue. Use this VBA module (https://github.com/kyledeer-32/vba_fuzzymatching.git) which contains 4 fuzzy matching UDFs. I used it with your data and got these results:

enter image description here

Here is a formula view:

enter image description here

The UDFs in the image (above), are from this VBA module. The "=fuzzy_match" function partitions the strings being traversed by turning sentences into substring arrays using (" ") as the delimiter, then matches by each sub-string in the array. The "=bestword" function returns the word/sub-string that best matched your string in column1.

Note: you will have to enable the "Microsoft Scripting Runtime" library in the Visual Basic Editor Window after importing this VBA module for it to run. This can be done by:

From Excel Workbook:

  • Select Developer tab on ribbon
  • Select Visual Basic
  • Select Tools on the Toolbar
  • Select References
  • Scroll down until you see Microsoft Scripting Runtime, then check the box
  • Press OK
Kyle Deer
  • 91
  • 8