Is it possible to GROUP BY treating certain characters the same?

Question

I have tables with data as such:

forename        surname
John            O'Neil
Aaron           O Neil
Peter           O-Neil
Mary            Tollmache-Tollmache
Beatrice        Tollmache Tollmache

All the data has been sanatised so they only contain characters from the Latin and Latin supplement character tables, as well as space ( ), appostrophe (') and hyphen (-).

Is it possible to treat the characters space ( ), appostrophe (') and hyphen (-) the same when performing a GROUP BY.

So, using the above table as an example, the following query:

SELECT surname, COUNT(*) AS cnt FROM tbl GROUP BY surname;

Would return:

surname                cnt
O'Neil                 3
Tollmache-Tollmache    2

UPDATE: Jim's suggestion is a useful one if you are, as I am, using this query to build other tables, but probably would not be a good idea for deployment.

Using Jim's suggestion, I ran the following tests on tables with the surname column indexed:

550,000 row table: 4.98 seconds
3.375 million row table: 27.62 seconds
12.485 million row table: 1 min 33 seconds

score 1 · Accepted Answer · answered Aug 12 '14 at 15:23

1

You can use REPLACE to convert those characters to spaces (or to apostrophes e.t.c.) and check:

SELECT surname, 
COUNT(*) AS cnt 
FROM tbl 
GROUP BY REPLACE(REPLACE(surname,"-"," "),"'"," ");

It may be worth ensuring that the data in the DB is in the format that you want rather than converting it on the fly however.

answered Aug 12 '14 at 15:23

Jim

22,354
6
52
80

Thanks for the suggestion. I need the data to remain as it is, but still treat them as the same for the purpose of the GROUP BY. I'll investigate this more, and accept your answer if it proves the best solution. – Kohjah Breese Aug 12 '14 at 15:25

Is it possible to GROUP BY treating certain characters the same?

1 Answers1