Select all records where last n characters in column are not unique

Question

I have bit strange requirement in mysql. I should select all records from table where last 6 characters are not unique.

for example if I have table:

I should select row 1 and 3 since last 6 letters of this values are not unique.

Do you have any idea how to implement this? Thank you for help.

Use substring to get the last 6 characters. https://dev.mysql.com/doc/refman/8.0/en/string-functions.html#function_substring — P.Salmon, Apr 17 '19 at 07:19

Ulrich Thomas Gabor · Answer 1 · 2019-04-17T07:44:32.853

1

Something like that should work:

SELECT `mytable`.*
FROM (SELECT RIGHT(`value`, 6) AS `ending` FROM `mytable` GROUP BY `ending` HAVING COUNT(*) > 1) `grouped`
INNER JOIN `mytable` ON `grouped`.`ending` = RIGHT(`value`, 6)

but it is not fast. This requires a full table scan. Maybe you should rethink your problem.

edited Apr 17 '19 at 07:44

answered Apr 17 '19 at 07:24

Ulrich Thomas Gabor

6,584
4
27
41

This query also generates an error because of a incorrect use of group by/having when `sql_mode=only_full_group_by` – Joakim Danielson Apr 17 '19 at 07:39
@JoakimDanielson Yes, you are right, I forgot the grouping. Fixed it now. – Ulrich Thomas Gabor Apr 17 '19 at 07:40
Hmm, no error now but it returns the opposite result as I understands it, the rows with unique values rather than the ones with duplicate values – Joakim Danielson Apr 17 '19 at 07:44
@JoakimDanielson Yes, noticed it too. I though he wanted the unique values, the constraint needed to be inverted, fixed that too. – Ulrich Thomas Gabor Apr 17 '19 at 07:45

FanoFN · Answer 2 · 2019-04-17T08:38:22.007

1

EDITED: I had a wrong understanding of the question previously and I don't really want to change anything from my initial answer. But if my previous answer is not acceptable in some environment and it might mislead people, I have to correct it anyhow.

SELECT GROUP_CONCAT(id),RIGHT(VALUE,6)
FROM table1
GROUP BY RIGHT(VALUE,6) HAVING COUNT(RIGHT(VALUE,6)) > 1;

Since this question already have good answers, I made my query in a slightly different way. And I've tested with sql_mode=ONLY_FULL_GROUP_BY. ;)

edited Apr 17 '19 at 08:38

answered Apr 17 '19 at 07:27

FanoFN

6,815
2
13
33

This query generates an error when I try it on dbfiddle using MySql 8, did you test it? – Joakim Danielson Apr 17 '19 at 07:34
You are mixing grouped and non-grouped columns in your select. This is not allowed in clean SQL: https://stackoverflow.com/a/5920259/3340665 MySQL allows it, but the results are nondeterministic. – Ulrich Thomas Gabor Apr 17 '19 at 07:35
@JoakimDanielson, I've tested on MySQL 4.1, MariaDB 10.1 and MariaDB 10.3. Seems to be fine – FanoFN Apr 17 '19 at 07:40
1

Well not with sql_mode=only_full_group_by. See comment above from @GhostGambler – Joakim Danielson Apr 17 '19 at 07:41
Is sql_mode=only_full_group_by set by default? I've just tested this on MySQL 5.6 and it still work – FanoFN Apr 17 '19 at 07:46
No, it is not set by default, but it should be set, because like I said, you basically get random results. – Ulrich Thomas Gabor Apr 17 '19 at 07:47
1

Ok, I understand now. AND I think I understand the question incorrectly as well – FanoFN Apr 17 '19 at 07:57
1

Now it works fine, let's see what OP thinks about the GROUP_CONCAT :) – Joakim Danielson Apr 17 '19 at 08:41

score 1 · Answer 3 · answered Apr 17 '19 at 07:33

1

I uses a JOIN against a subquery where I count the occurences of each unique combo of n (2 in my example) last chars

SELECT t.*
FROM t
JOIN (SELECT RIGHT(value, 2) r, COUNT(RIGHT(value, 2)) rc 
      FROM t 
      GROUP BY r) c ON c.r = RIGHT(value, 2) AND c.rc > 1

answered Apr 17 '19 at 07:33

Joakim Danielson

43,251
5
22
52

Comparing execution plans, your query is actually better than mine. (Still bad, because full table scan, but it is optimized better.) – Ulrich Thomas Gabor Apr 17 '19 at 07:50
@GhostGambler You can minimize the full table scan by creating an index for `RIGHT(value, 2)` if the RDBMS is capable of it, e.g., Postgres – Michael Buen Apr 17 '19 at 13:40

kiks73 · Answer 4 · 2019-04-17T09:36:23.313

1

This is what you need: a subquery to get the duplicated right(value,6) and the main query yo get the rows according that condition.

SELECT t.* FROM t WHERE RIGHT(`value`,6) IN (
    SELECT RIGHT(`value`,6)
    FROM t
    GROUP BY RIGHT(`value`,6) HAVING COUNT(*) > 1);

UPDATE

This is the solution to avoid the mysql error in the case you have sql_mode=only_full_group_by

SELECT t.* FROM t WHERE RIGHT(`value`,6) IN (
    SELECT DISTINCT right_value FROM (
        SELECT RIGHT(`value`,6) AS right_value, 
               COUNT(*) AS TOT
        FROM t
        GROUP BY RIGHT(`value`,6) HAVING COUNT(*) > 1)  t2
        )

Fiddle here

edited Apr 17 '19 at 09:36

answered Apr 17 '19 at 07:39

kiks73

3,718
3
25
52

Not sure how good this is compared to our approaches. In the past MySQL behaved often badly when using `IN`. Nevertheless, execution plan right now seems to be ok. Upvote. – Ulrich Thomas Gabor Apr 17 '19 at 08:00
@JoakimDanielson Where is this violated? The outer query has no group, the inner one has one, but only the grouped column is selected – Ulrich Thomas Gabor Apr 17 '19 at 08:32
@GhostGambler It is the HAVING clause that is the issue, it is not using the same aggregated function as the GROUP BY, here is my [dbfiddle](https://dbfiddle.uk/?rdbms=mysql_8.0&fiddle=d6fae3418eabe0ac851af7b9e87348d2) – Joakim Danielson Apr 17 '19 at 08:36
@JoakimDanielson Interesting. Thanks! – Ulrich Thomas Gabor Apr 17 '19 at 09:50
The new solution seems unnecessary complicated to me. – Ulrich Thomas Gabor Apr 17 '19 at 09:52
Upvoted (the first query) as your answer is the fastest and straightforward – Michael Buen Apr 17 '19 at 12:55
2

@JoakimDanielson I feel the later version of MySQL 5.7 and 8.0 are retarded. There's nothing unusual with kiks73's answer. MySQL 5.6 runs his code just fine. Only when you run it in 5.7 and 8.0 that MySQL complain. MySQL 5.7 and 8.0's behavior feels more like a bug, as his code will practically run all RDBMS, Postgres, SQL Server, Oracle, I think even Microsoft Access :D His code has no problem in MySQL 5.6: https://www.db-fiddle.com/f/j8eq3BThgkRA1rCc5JtThS/0 – Michael Buen Apr 17 '19 at 13:07
@MichaelBuen You are right, this seems to be a limitation only in later versions of MySql. Down vote removed. – Joakim Danielson Apr 17 '19 at 13:13
For completeness. MySQL 5.6: https://dbfiddle.uk/?rdbms=mysql_5.6&fiddle=524e000e568061f9288f0716e6b5ec7c Oracle 11g: https://dbfiddle.uk/?rdbms=oracle_11.2&fiddle=4fbb5e0800631c42233b19eae564c910 Postgres 11: https://dbfiddle.uk/?rdbms=postgres_11&fiddle=524e000e568061f9288f0716e6b5ec7c SQL Server 2017: https://dbfiddle.uk/?rdbms=sqlserver_2017&fiddle=524e000e568061f9288f0716e6b5ec7c SQLIte 3.8: https://dbfiddle.uk/?rdbms=sqlite_3.8&fiddle=da6814f1f6a11065cd27cbb8dd995ad6 – Michael Buen Apr 17 '19 at 13:29
DB2 OK too: https://dbfiddle.uk/?rdbms=db2_11.1&fiddle=be5b14c9ec901e1823ab6bedcbb8365f – Michael Buen Apr 17 '19 at 13:32
Instead of doing a contortion with DISTINCT, subquery and other things, you can use `any_value`. Here: `SELECT t.* FROM t WHERE RIGHT(value,2) IN (SELECT RIGHT(any_value(value),2) FROM t GROUP BY RIGHT(value,2) HAVING COUNT(*) > 1);`. That's better than the DISTINCT you use to avoid the MySQL error. Really, more than anything, your first query is correct, MySQL 5.7 and 8.0 implementation of `only_full_group_by` is broken on some things that should work – Michael Buen Apr 18 '19 at 01:49
Work-around for 8.0 without disabling `only_full_group_by` https://dbfiddle.uk/?rdbms=mysql_8.0&fiddle=506cfd908367b6b069b11d8113d64655 Use `any_value`. But that's hardly a good work-around, the query should legitimately work even without any_value as the SELECT expression is deterministic (`RIGHT(value, 2)`) since it's a mirror copy of what's in GROUP BY. Like with all RDBMS, it should work even without `any_value`. MySQL 5.7 and 8.0 are the odd ones out. – Michael Buen Apr 18 '19 at 04:35

Michael Buen · Answer 5 · 2019-04-17T13:49:47.230

1

~~Might be a fast code, as there is no counting involved.~~

Live test: https://www.db-fiddle.com/f/dBdH9tZd4W6Eac1TCRXZ8U/0

select *
from tbl outr
where not exists
(
    select 1 / 0 -- just a proof that this is not evaluated. won't cause division by zero
    from tbl inr
    where 
        inr.id <> outr.id
        and right(inr.value, 6) = right(outr.value, 6)  
)

Output:

| id  | value           |
| --- | --------------- |
| 2   | aaaaaaaaaaaaaa  |
| 4   | aaaaaaaaaaaaaaB |
| 5   | Hello           |

The logic is to test other rows that is not equal to the same id of the outer row. If those other rows has same right 6 characters as the outer row, then don't show that outer row.

UPDATE

I misunderstood the OP's intent. It's the reversed. Anyway, just reverse the logic. Use EXISTS instead of NOT EXISTS

Live test: https://www.db-fiddle.com/f/dBdH9tZd4W6Eac1TCRXZ8U/3

select *
from tbl outr
where exists
(
    select 1 / 0 -- just a proof that this is not evaluated. won't cause division by zero
    from tbl inr
    where 
        inr.id <> outr.id
        and right(inr.value, 6) = right(outr.value, 6)  
)

Output:

| id  | value       |
| --- | ----------- |
| 1   | abcdePuzzle |
| 3   | abcPuzzle   |

UPDATE

Tested the query. The performance of my answer (correlated EXISTS approach) is not optimal. Just keeping my answer, so others will know what approach to avoid :)

GhostGambler's answer is faster than correlated EXISTS approach. For 5 million rows, his answer takes 2.762 seconds only:

explain analyze                                   
SELECT
    tbl.*
FROM
    (
        SELECT
            RIGHT(value, 6) AS ending
        FROM
            tbl
        GROUP BY
            ending
        HAVING
            COUNT(*) > 1
    ) grouped
    JOIN tbl ON grouped.ending = RIGHT(value, 6)

My answer (correlated EXISTS) takes 4.08 seconds:

explain analyze
select *
from tbl outr
where exists
(
    select 1 / 0 -- just a proof that this is not evaluated. won't cause division by zero
    from tbl inr
    where 
        inr.id <> outr.id
        and right(inr.value, 6) = right(outr.value, 6)          
)

Straightforward query is the fastest, no join, just plain IN query. 2.722 seconds. It has practically the same performance as JOIN approach since they have the same execution plan. This is kiks73's answer. I just don't know why he made his second answer unnecessarily complicated.

So it's just a matter of taste, or choosing which code is more readable select from in vs select from join

explain analyze
SELECT *
FROM tbl
where right(value, 6) in 
    (
        SELECT
            RIGHT(value, 6) AS ending
        FROM
            tbl
        GROUP BY
            ending
        HAVING
            COUNT(*) > 1
    )

Result:

Test data used:

CREATE TABLE tbl (
  id INTEGER primary key,
  value VARCHAR(20)
);

INSERT INTO tbl
  (id, value)
VALUES
  ('1', 'abcdePuzzle'),
  ('2', 'aaaaaaaaaaaaaa'),
  ('3', 'abcPuzzle'),
  ('4', 'aaaaaaaaaaaaaaB'),
  ('5', 'Hello');


insert into tbl(id, value)
select x.y, 'Puzzle'
from generate_series(6, 5000000) as x(y);

create index ix_tbl__right on tbl(right(value, 6));

Performances without the index, and with index on tbl(right(value, 6)):

JOIN approach:

Without index: 3.805 seconds

With index: 2.762 seconds

IN approach:

Without index: 3.719 seconds

With index: 2.722 seconds

edited Apr 17 '19 at 13:49

answered Apr 17 '19 at 09:06

Michael Buen

38,643
9
94
118

This is evaluated as `DEPENDENT SUBQUERY`, which is likely slower than all other solutions. – Ulrich Thomas Gabor Apr 17 '19 at 09:55
@GhostGambler Not all correlated subqueries are slow. If it to be contrasted with `COUNT` approach (think COUNT having million rows), the COUNT will continue counting regardless if it already satisfy the sameness/uniqueness against the outer table. Unlike with correlated subquery approach that uses `EXISTS/NOT EXISTS`, it will stop evaluating as soon as the criteria is met – Michael Buen Apr 17 '19 at 09:57
RDBMS can utilize the `EXISTS/NOT EXISTS` hint to short-circuit the evaluation as soon as at least one criteria is met – Michael Buen Apr 17 '19 at 10:01
1

Maybe you are right, but I guess the same optimization could trigger when checking if `COUNT > x`, since also in that case you do not actually need to compute the exact count. I guess your solution is worth to try with real data and the used MySQL version. – Ulrich Thomas Gabor Apr 17 '19 at 10:05
1

@GhostGambler You're right. `COUNT > x` approach is faster. Perhaps the table-deriving nature of the `COUNT > x` approach made the query inside materializable to a table. It's materializable since it is not correlated/dependent on the source table. And then the one-time materialized table can be JOIN'd / IN'd against the other table. Unlike the `EXISTS` approach, where each row needed to be evaluated one-by-one against the correlated query – Michael Buen Apr 17 '19 at 12:36
The `IN` approach simply does not work with `sql_mode=only_full_group_by`. – Ulrich Thomas Gabor Apr 17 '19 at 18:37
@GhostGambler `IN` approach works. It works on MySQL 5.6 https://dbfiddle.uk/?rdbms=mysql_5.6&fiddle=524e000e568061f9288f0716e6b5ec7c Oracle 11g: https://dbfiddle.uk/?rdbms=oracle_11.2&fiddle=4fbb5e0800631c42233b19eae564c910 Postgres 11: https://dbfiddle.uk/?rdbms=postgres_11&fiddle=524e000e568061f9288f0716e6b5ec7c SQL Server 2017: https://dbfiddle.uk/?rdbms=sqlserver_2017&fiddle=524e000e568061f9288f0716e6b5ec7c SQLite 3.8 https://dbfiddle.uk/?rdbms=sqlite_3.8&fiddle=da6814f1f6a11065cd27cbb8dd995ad6 DB2: https://dbfiddle.uk/?rdbms=db2_11.1&fiddle=be5b14c9ec901e1823ab6bedcbb8365f – Michael Buen Apr 18 '19 at 01:23
@GhostGambler MySQL's purpose for `sql_mode=only_full_group_by` is well-intentioned, as it prevents [non-deterministic values from selected fields](https://dev.mysql.com/doc/refman/5.7/en/group-by-handling.html) when doing aggregation. `sql_mode=only_full_group_by` works, as it prevent things that should not work(non-deterministic values), but MySQL team implementation of it is broken, as it also prevent some other things that should work. Their implementation of it in MySQL 5.7 and 8.0 is broken – Michael Buen Apr 18 '19 at 01:35
Work-around for 8.0 without disabling `only_full_group_by` https://dbfiddle.uk/?rdbms=mysql_8.0&fiddle=506cfd908367b6b069b11d8113d64655 Use `any_value`. But that's hardly a good work-around, the query should legitimately work even without `any_value` as the SELECT expression is deterministic (`RIGHT(value, 2)`) since it's a mirror copy of what's in GROUP BY. Like with all RDBMS, it should work even without `any_value`. MySQL 5.7 and 8.0 are the odd ones out. – Michael Buen Apr 18 '19 at 04:34
There's a mile between "it works" and "it conforms to the standard". You are right that in specific cases it might be irrelevant, that a selected column is actually nonaggregated, because all values are the same, but if such cases are not relaxed in the SQL standard, then one should not depend on this behavior. That is what the setting is for. The `any_value` function looks like one should never use it. It's describe differently for Microsoft SQL server and does not exist in postgresql https://stackoverflow.com/a/48425089/3340665 this makes me doubt it's a good idea to use it for this case. – Ulrich Thomas Gabor Apr 18 '19 at 08:42
@GhostGambler It's commendable that they have a setting that produces predictable results. The only problem is they didn't do a regression test of what it should prevent and what it should not prevent. This should not be prevented: `SELECT expressionThatAppearsOnGroupClauseHere GROUP BY expressionThatAppearsOnSelect`. Any sane RDBMS should not have an `any_value` functionality. I think since MySQL team can't figure out how to make their parser properly work for things that should not be prevented, that they provided an escape hatch (`any_value`) so as not to annoy the heck out of people – Michael Buen Apr 18 '19 at 09:54
Seems it is unknown if this behavior is according to the SQL standard or not: https://bugs.mysql.com/bug.php?id=90792. Seems to me like it should be easy to find out, but maybe nobody cares, because it's a rare query and there are multiple workarounds. – Ulrich Thomas Gabor Apr 18 '19 at 11:47
@GhostGambler I think this comment sums it up best: "Your argument 'All these comparisons of functions have a runtime cost and an implementation cost' doesn’t hold up. Oracle was able to implement it efficiently on much slower hardware" -- https://mysqlserverteam.com/when-only_full_group_by-wont-see-the-query-is-deterministic/#comment-11411. I feel it's just a convenient excuse for MySQL team to be pedantic to standard when it comes to that issue, when all the while all their other implementations/suggestions/work-arounds are not standard – Michael Buen Apr 18 '19 at 13:01

Michael Buen · Answer 6 · 2019-04-17T09:40:41.457

Just a bit neater code (if using MySQL 8.0). Can't guarantee the performance though

Live test: https://www.db-fiddle.com/f/dBdH9tZd4W6Eac1TCRXZ8U/1

select x.*
from 
(
    select  
        *, 
        count(*) over(partition by right(value, 6)) as unique_count
    from tbl
 ) as x
 where x.unique_count = 1

Output:

| id  | value           | unique_count |
| --- | --------------- | ------------ |
| 2   | aaaaaaaaaaaaaa  | 1            |
| 4   | aaaaaaaaaaaaaaB | 1            |
| 5   | Hello           | 1            |

UPDATE

I misunderstood OP's intent. It's the reversed. Just change the count:

select x.*
from 
(
    select  
        *, 
        count(*) over(partition by right(value, 6)) as unique_count
    from tbl
 ) as x
 where x.unique_count > 1

Output:

| id  | value       | unique_count |
| --- | ----------- | ------------ |
| 1   | abcdePuzzle | 2            |
| 3   | abcPuzzle   | 2            |

Select all records where last n characters in column are not unique

6 Answers6