1

Is this possible ?

Let's say I have a list of addresses, with a lot of duplicate entries. I need to filter out all the duplicates, because some addresses have slightly different names but the same postalcode and telephone number.

First I do a GROUP BY on postalcode and telephone.

SELECT name, address, postalcode, place, phone
FROM addresses
GROUP BY postalcode, phone

But then I get random names. I would like to get the best name, that is, the name with the most entries per postalcode/phone.

So I thought of the following. Here I use the SUBSTRING_INDEX function to only get the first item in the group_concat (there are no names with the string '~~' in it):

SELECT SUBSTRING_INDEX(
         GROUP_CONCAT(DISTINCT name ORDER BY COUNT(name) DESC SEPARATOR '~~')
       , '~~', 1),
       address,
       postalcode,
       place,
       phone
FROM addresses
GROUP BY postalcode, telephone

but I get an 'invalid use of group function'.

How do I get the GROUP_CONCAT to order by the number of times the name occurs ?

Dylan
  • 9,129
  • 20
  • 96
  • 153
  • And what should happen with `address` and `place`? They might have different values too which might not be related to the returned name. – Martin Smith Aug 18 '11 at 00:16
  • Well, I use a postalcode table to check address/place from postalcode, so it's basically names I need to filter. (In Holland we have a different postalcode for each street) – Dylan Aug 18 '11 at 00:19

1 Answers1

1

Found a solution myself, with a subquery:

SELECT 
  SUBSTRING_INDEX(
    GROUP_CONCAT(DISTINCT name ORDER BY CountName DESC SEPARATOR '||')
  , '||', 1),
  address,
  postalcode,
  place,
  phone
FROM (

  SELECT name, address, postalcode, place, phone , COUNT(name) AS CountName
  FROM addresses
  GROUP BY name, postalcode, phone
  ORDER BY COUNT(name) DESC

) as a
GROUP BY postalcode, phone

I wonder if it can be done without a subquery.

Dylan
  • 9,129
  • 20
  • 96
  • 153