1

this is my first post and I'm something of a novice.

I'm trying to extract a list of fuzzy duplicates from and table of company names using SQLite with sqlite3.

For example, for a company like 'ZtPay' I'm currently using:

SELECT name FROM tab WHERE name LIKE 'Z_Pay');

SELECT name FROM tab WHERE name LIKE 'Zt_ay');

etc... To account for typos.

My problem is that if there is no typo then I just output the original company name. Ideally I would like only to output the original name if there was a fuzzy duplicate found by the LIKE.

I know this is very wrong but I want something along the lines of:

SELECT name FROM tab WHERE name LIKE 'Z_Pay' IF ATLEAST 2 name LIKE 'Z_Pay' 

Thanks in advance for any help you can give me.

juergen d
  • 201,996
  • 37
  • 293
  • 362

1 Answers1

0

You can determine if there is more than one name by looking at the min() and max():

SELECT name 
FROM tab 
WHERE name LIKE 'Zt_ay%' 
group by name
having min(name) <> max(name)

Alternatively, you could use count(distinct):

SELECT name 
FROM tab 
WHERE name LIKE 'Zt_ay%' 
group by name
having count(distinct name) > 1;
Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • Thanks, but I think I needed to be more specific. For example, I have these 3 entries. Complete Business Services Camplete Business Services ZTPay I only want to output Complete Business Services & Camplete Business services, i.e the Original and the fuzzy, but not ZTPay as it has no fuzzy. – Amman Singh Sohi Jul 05 '13 at 14:14
  • @AmmanSinghSohi . . . Your question (as I interpret it) says that you have the logic for the fuzzy matches and are trying to return the cases where there is a similar name. If you don't really have that logic, I'd suggest asking another question and including sample data and expected results. – Gordon Linoff Jul 05 '13 at 14:23
  • @GordronLinoff My logic's problem is that as well as returning a fuzzy and a correct record together (Complete and Camplete), it will return a correct record (ZTPay) that has no fuzzy. I would only like to see correct records if they have a fuzzy. I know this next code is wrong but it explains what I'd like IF count(Name LIKE 'C_omplete Business Services') > 1 SELECT Name FROM Real WHERE Name LIKE 'C_mplete Business Services' That way no records without fuzzys would be produced as every record with a fuzzy would have at least 2 results for the LIKE search. Thank you again for your time. – Amman Singh Sohi Jul 05 '13 at 15:38
  • @AmmanSinghSohi . . . What you describe is what these `having` clauses do. – Gordon Linoff Jul 05 '13 at 16:34
  • @GordronLinoff HAVING does not seem to work for me. For example, SELECT Name FROM tab WHERE name LIKE '_ntertek' GROUP BY name HAVING count(name) > 1; ; If I had three names, Intertek, Ontertek and Entertek, this HAVING clause would return no results. I want to say something like, SELECT Name FROM tab WHERE name LIKE '_ntertek' GROUP BY name HAVING count(name LIKE '_ntertek') > 1; ; How would I achieve this? Thanks once more for your time, it much appreciated. – Amman Singh Sohi Jul 08 '13 at 10:00
  • I realise it may be worth mentioning that there are no exact duplicates in the data, only fuzzy ones. Therefore there are no duplicate fuzzies. E.g Micrsoft, Microsaft & Microseft could be present, but Microsoft, Microsaft & Microsaft could not be. – Amman Singh Sohi Jul 08 '13 at 10:38
  • @AmmanSinghSohi: Do you want to return *all* the fuzzy names or just one (when they are more than one)? If only one, then which? – Andriy M Jul 09 '13 at 13:01
  • Sorry, I wasn't being clear on that part so posted it in another question. http://stackoverflow.com/questions/17532083/sqlite-fuzzy-duplicate-search-using-like/17541735?noredirect=1#17541735 – Amman Singh Sohi Jul 09 '13 at 14:16