1

Using MySQL 5.1.66-0+squeeze1-log on Debian, I get a GROUP BY result I don't understand.

If I GROUP BY data, unequal data values are combined, which doesn't make any sense to me. If I GROUP BY a hash value on the same column, SHA1(data), everything works fine and only equal values for data are combined in the group.

What is going on here? It almost seems like the GROUP BY only considers the first x characters of the column. If not that, why else could this happen? Is it maybe just a twist in my brain?

EDIT 1: An example value for data (json-encoded legacy - back when I was even dumber ;) ):

{"a":[{"val":{"tcn":{"1980":"1","1981":"1","1982":"1","1983":"1","1984":"1","1985":"1","1986":"1","1987":"1","1988":"1","1989":"1","1990":"1","1991":"1","1992":"1","1993":"1","1994":"1","1995":"1","1996":"1","1997":"1","1998":"1","1999":"1","2000":"1","2001":"1","2002":"1","2003":"1","2004":"1","2005":"1","2006":"1","2007":"1","2008":"1","2009":"1","2010":"1"},"sic":{"1980":"1","1981":"1","1982":"1","1983":"1","1984":"1","1985":"1","1986":"1","1987":"1","1988":"1","1989":"1","1990":"1","1991":"1","1992":"1","1993":"1","1994":"1","1995":"1","1996":"1","1997":"1","1998":"1","1999":"1","2000":"1","2001":"1","2002":"1","2003":"1","2004":"1","2005":"1","2006":"1","2007":"1","2008":"1","2009":"1","2010":"1"}}}],"b":[{"val":{"tcn":{"1980":"1","1981":"1","1982":"1","1983":"1","1984":"1","1985":"1","1986":"1","1987":"1","1988":"1","1989":"1","1990":"1","1991":"1","1992":"1","1993":"1","1994":"1","1995":"1","1996":"1","1997":"1","1998":"1","1999":"1","2000":"1","2001":"1","2002":"1","2003":"1","2004":"1","2005":"1","2006":"1","2007":"1","2008":"1","2009":"1","2010":"1"},"sic":{"1980":"1","1981":"1","1982":"1","1983":"1","1984":"1","1985":"1","1986":"1","1987":"1","1988":"1","1989":"1","1990":"1","1991":"1","1992":"1","1993":"1","1994":"1","1995":"1","1996":"1","1997":"1","1998":"1","1999":"1","2000":"1","2001":"1","2002":"1","2003":"1","2004":"1","2005":"1","2006":"1","2007":"1","2008":"1","2009":"1","2010":"1"}}}],"0":[{"val":{"com":{"able":"2"}},"str":{"com":{"comm":"According","src":{"1":{"name":"law 256","articles":"B2\/2.11","links":"","type":""},"2":{"name":"law 298","articles":"B.19\/2.3","links":"","type":""}}}}}]}

EDIT 2: Sorry for leaving out the code, I thought it would make it shorter and easier. Obviously the opposite is the case...

SELECT
    GROUP_CONCAT(resid) AS ids
    ,data
FROM resdata
GROUP BY data

vs.

SELECT
    GROUP_CONCAT(resid) AS ids
    ,CAST(SHA1(data) AS CHAR(40)) AS hash
    ,data
FROM resdata
GROUP BY hash
Chonez
  • 144
  • 10
  • 1
    Please show us your code. – Kermit Sep 03 '13 at 16:58
  • Please show the full SQL statement. I wouldn't be surprised if that is a result of MySQL's lose implementation of the `GROUP BY` operator: http://www.mysqlperformanceblog.com/2006/09/06/wrong-group-by-makes-your-queries-fragile/ –  Sep 03 '13 at 16:58
  • Is this question about `JSON` or about MySQL? – Kermit Sep 03 '13 at 17:03
  • @FreshPrinceOfSO: This is about MySQL's GROUP BY. @a_horse_with_no_name: The loose implementation affects the columns that you DO NOT group by, isn't that right? I DO group by `data` though. – Chonez Sep 03 '13 at 17:09
  • @Chonez Correct. See my answer. – Kermit Sep 03 '13 at 17:10

1 Answers1

1

I finally figured it out. The problem only occurs when a GROUP_CONCAT() is present, as also discussed in GROUP_CONCAT() row count when grouping by a text field (which I only found after figuring out that it was linked to the concat :s ).

ORDER BY, DISTINCT and (indirectly) GROUP_CONCAT() all depend on the max_sort_length system variable. Any query which uses these operators/functions will only consider the first max_sort_length bytes of the column, in my case the default 1024 bytes.

Though GROUP BY doesn't use ORDER BY, GROUP_CONCAT() by default uses ORDER BY on the column that you use in the GROUP BY statement. (Thanks to Saharsh Shah, Jan 4 at 12:42)

Most of the values in my data column are much longer than max_sort_length. In my case, there are 377 rows where the first 1024 bytes are the same, but the rest differs. Therefore in my case, DISTINCT and GROUP BY will only return 2360 rows, even though there are 2737 different values.

So be careful on GROUPing a text column with texts longer than your max_sort_length! It might not represent the distinct results one is used to when operating on INTs and smaller CHARs. DISTINCT will show the same behaviour, which will give you a false-positive when using it to check for the completeness of the GROUP BY.

Community
  • 1
  • 1
Chonez
  • 144
  • 10