I have tables with data as such:
forename surname
John O'Neil
Aaron O Neil
Peter O-Neil
Mary Tollmache-Tollmache
Beatrice Tollmache Tollmache
All the data has been sanatised so they only contain characters from the Latin and Latin supplement character tables, as well as space ( ), appostrophe (') and hyphen (-).
Is it possible to treat the characters space ( ), appostrophe (') and hyphen (-) the same when performing a GROUP BY.
So, using the above table as an example, the following query:
SELECT surname, COUNT(*) AS cnt FROM tbl GROUP BY surname;
Would return:
surname cnt
O'Neil 3
Tollmache-Tollmache 2
UPDATE: Jim's suggestion is a useful one if you are, as I am, using this query to build other tables, but probably would not be a good idea for deployment.
Using Jim's suggestion, I ran the following tests on tables with the surname column indexed:
550,000 row table: 4.98 seconds
3.375 million row table: 27.62 seconds
12.485 million row table: 1 min 33 seconds