34

Basically, we have one table (original table) and it is backed up into another table (backup table); thus the two tables have exactly the same schema.

In the beginning, both tables (original table and backup table) contains exactly the same set of data. After some time for some reason, I need to verify whether dataset in the original table has changed or not.

In order to do this, I have to compare the dataset in the original table against the backup table.

Let's say the original table has the following schema:

create table LemmasMapping (
   lemma1 int,
   lemma2 int,
   index ix_lemma1 using btree (lemma1),
   index ix_lemma2 using btree (lemma2)
)

How could I achieve the dataset comparison?

Update: the table does not have a primary key. It simply stores mappings between two ids.

Panos Kalatzantonakis
  • 12,525
  • 8
  • 64
  • 85
SiLent SoNG
  • 4,270
  • 4
  • 27
  • 31

13 Answers13

45

You can just use CHECKSUM TABLE and compare the results. You can even alter the table to enable live checksums so that they are continuously available.

CHECKSUM TABLE original_table, backup_table;

It doesn't require the tables to have a primary key.

Josh Davis
  • 28,400
  • 5
  • 52
  • 67
  • 10
    Attention, even if the checksums are the same, the tables might be different. – Miloš Černilovský Jul 07 '14 at 04:49
  • 3
    Apparently a feature not present in PostgreSQL. – Faheem Mitha Jul 29 '14 at 09:31
  • this should be the accepted answer because not only its elegance but also shows very high performance. I tried myself comparing with two innoDb tables with ~200k rows, CHECKSUM took ~0.8 s while using UNION resulted in around 8 s – Va1iant Oct 18 '17 at 01:34
  • 1
    CHECKSUM may serve your purposes, in practice. But I don't like it as a solution, because you're relying on a hash. Any function mapping from many-to-one will have false matches. From the user documentation on MariaDB: "Two identical tables should always match to the same checksum value; however, also for non-identical tables there is a very slight chance that they will return the same value as the hashing algorithm is not completely collision-free." – Mark Goldfain Jun 01 '19 at 17:38
  • I get a syntax error in SQL server on Azure. Any idea why? `Failed to execute query. Error: Incorrect syntax near the keyword 'TABLE'.` – Kevin Glynn Jan 07 '21 at 20:01
  • Is it possible to checksum two tables while using a `WHERE` condition? – Stevoisiak Mar 30 '22 at 20:43
36

If you do not care about one table being less than the other, but you only care about differences in the attributes:

SELECT * FROM Table1
UNION
SELECT * FROM Table2

If you get records greater than the max number of the two tables, they don't have same data.

questionto42
  • 7,175
  • 4
  • 57
  • 90
Gopal
  • 369
  • 3
  • 2
  • Couldn't it be `select count(*) from T1 union select count(*) from T2`? – nvlass Dec 13 '12 at 18:22
  • Unfortunately, no. That's just going to take the union of the counts, so if they have the same count, that will come up with only one row. – crunkchitis Apr 05 '13 at 22:49
  • When you say *records greater than any of two tables, they don't have same data* do you mean if table a has 20 rows and table b has 20 rows, if they union produces more than 20 rows then table a and table b are not identical? – gh0st Aug 25 '15 at 16:22
  • @gh0st that is how UNION works - so if you get more than actual rows number then somewhere are changes – pbaranski Jan 21 '16 at 08:33
  • I think this is a lovely use for UNION. There are some edge cases to consider here. You may need to first look at SELECT * FROM Table1 UNION SELECT * FROM Table1 (union the table with itself) to verify that no duplicate rows exist. Likewise with Table2, Depending on your situation, you may not need to worry about this, of course. – Mark Goldfain Jun 01 '19 at 17:22
  • 1
    I got this error, "The data type `text` cannot be used as an operand to the UNION, INTERSECT or EXCEPT operators because it is not comparable." Therefore, I think this method is not good enough – KY Lu Jun 17 '20 at 05:46
21

I would write three queries.

  1. An inner join to pick up the rows where the primary key exists in both tables, but there is a difference in the value of one or more of the other columns. This would pick up changed rows in original.

  2. A left outer join to pick up the rows that are in the original tables, but not in the backup table (i.e. a row in original has a primary key that does not exist in backup). This would return rows inserted into the original.

  3. A right outer join to pick up the rows in backup which no longer exist in the original. This would return rows that have been deleted from the original.

You could union the three queries together to return a single result set. If you did this you would need to add a column to indicate what type of row it is (updated, inserted or deleted).

With a bit of effort, you might be able to do this in one query using a full outer join. Be careful with outer joins, as they behave differently in different SQL engines. Predicates put in the where clause, instead of the join clause can sometimes turn your outer join into an inner join.

Panos Kalatzantonakis
  • 12,525
  • 8
  • 64
  • 85
Mike Thompson
  • 6,708
  • 3
  • 32
  • 39
  • 3
    "*Be careful with outer joins, as they behave differently in different SQL engines*" - care to explain that? I have not seen any DBMS that does things differently when an outer join is used. "*Predicates put in the where clause, instead of the join clause can sometimes turn your outer join into an inner join*" - yes of course they will if those predicates use columns from the "outer table". When comparing a null value with a constant value the rows with nulls will be discarded. –  Jul 30 '14 at 13:26
5

Most DBMSs support relational algebra's DIFFERENCE operation. In Snowflake, it is represented as MINUS. So the following code would work. The first MINUS makes sure every element in Table1 is also in Table2. The second MINUS makes sure every element in Table2 is also in Table1. If the query returns a zero, mathematically, the two tables are exactly the same.

select count(*) from 
(
 (
  SELECT * FROM Table1
  MINUS
  SELECT * FROM Table2
 )
 UNION ALL
 (
  SELECT * FROM Table2
  MINUS
  SELECT * FROM Table1
 )
)
Jie
  • 1,107
  • 1
  • 14
  • 18
3

However, You can also use the 'Except' keyword in SQL Server.

SELECT Col1, Col2, Col3, Col4, Col5 FROM Table_1
EXCEPT
SELECT Col1, Col2, Col3, Col4, Col5 FROM Table_2

If the query returns any rows: They are not Identical.

If the query returns no rows: They are Identical.

zulqadar idrishi
  • 105
  • 1
  • 11
  • if Table_2 has all Table_1 rows and one unique row, your query will return 0 rows. This looks like an invalid approach – aderesh Oct 11 '21 at 16:53
1
select count(*) 
from lemmas as original_table 
      full join backup_table using (lemma_id)
where backup_table.lemma_id is null
      or original_table.lemma_id is null
      or original_table.lemma != backup_table.lemma

The full join / check for null should cover additions or deletions as well as changes.

  • backup.id is null = addition
  • original.id is null = deletion
  • neither null = change
APC
  • 144,005
  • 19
  • 170
  • 281
Kyle Butt
  • 9,340
  • 3
  • 22
  • 15
  • Good idea, except for the fact that MySQL does not support full outer joins. –  Jul 30 '14 at 13:20
1

Try the following to compare two tables:

SELECT 'different' FROM DUAL WHERE EXISTS(
    SELECT * FROM (
        SELECT /*DISTINCT*/ +1 AS chk,a.c1,a.c2,a.c3 FROM a
        UNION ALL
        SELECT /*DISTINCT*/ +1 AS chk,b.c1,b.c2,b.c3 FROM b
    ) c
    GROUP BY c1,c2,c3
    HAVING SUM(chk)<>2
)
UNION SELECT 'equal' FROM DUAL
LIMIT 1;
serenesat
  • 4,611
  • 10
  • 37
  • 53
Ulghar
  • 11
  • 1
1

Please try the following method for determining if two tables are exactly the same, when there is no primary key of any kind and there are no duplicate rows within a table, using the below logic:

Step 1 - Test for Duplicate Rows on TABLEA

If SELECT DISTINCT * FROM TABLEA

has the same row count as

SELECT * FROM TABLEA

then go to the next step, otherwise you can't use this method...

Step 2 - Test for Duplicate Rows on TABLEB

If SELECT DISTINCT * FROM TABLEB

has the same row count as

SELECT * FROM TABLEB

then go to the next step, else you can't use this method...

Step 3 - INNER JOIN TABLEA to TABLEB on every column

If the row count of the below query has the same row count as the row counts from Steps 1 and 2, then the tables are the same:

SELECT
*

FROM
TABLEA

INNER JOIN TABLEA ON
TABLEA.column1 = TABLEB.column1
AND TABLEA.column2 = TABLEB.column2
AND TABLEA.column3 = TABLEB.column3 
--etc...for every column

Note that this method doesn't necessarily test for different data types, and probably won't work on non-joinable data types (like VARBINARY)

Feedback welcome!

Shahzad Barkati
  • 2,532
  • 6
  • 25
  • 33
Hank
  • 31
  • 5
1

1: First get count for both the tables C1 and C2. C1 and C2 should be equal. C1 and C2 can be obtained from the following query

 select count(*) from table1

if C1 and C2 are not equal, then the tables are not identical.

2: Find distinct count for both the tables DC1 and DC2. DC1 and DC2 should be equal. Number of distinct records can be found using the following query:

select count(*) from (select distinct * from table1)

if DC1 and DC2 are not equal, the tables are not identical.

3: Now get the number of records obtained by performing a union on the 2 tables. Let it be U. Use the following query to get the number of records in a union of 2 tables:

 SELECT count (*)
 FROM 
    (SELECT *
    FROM table1
    UNION
    SELECT *
    FROM table2)

You can say that the data in the 2 tables is identical if distinct count for the 2 tables is equal to the number of records obtained by performing union of the 2 tables. ie DC1 = U and DC2 = U

pkhabya
  • 333
  • 4
  • 15
0

For the lazier or more SQL-averse developer working with MS SQL Server, I would recommend SQL Delta (www.sqldelta.com) for this and any other database-diff type work. It has a great GUI, is quick and accurate and can diff all database objects, generate and run the necessary change scripts, synchronise entire databases. Its the next best thing to a DBA ;-)

I think there is a similar tool available from RedGate called SQL Compare. I believe some editions of the latest version of Visual Studio (2010) also include a very similar tool.

immutabl
  • 6,857
  • 13
  • 45
  • 76
  • UPDATE - I've been using the Data compare tools in VS2010 and imho they're not fit to like the boots of either of the products above... – immutabl May 11 '10 at 13:35
0

Easy workaround for the quick gain:

SELECT sum(col1), sum(col2) FROM table1
UNION
SELECT sum(col1), sum(col2) FROM table2

Output is like:

sum(col1) sum(col2)
11111345678 123456789101234
11111123456 123456789101234

and then check for differences by testing around:

SELECT count(col1), sum(col2) FROM table1 WHERE col1 = 0
UNION
SELECT count(col1), sum(col2) FROM table1 WHERE col1 = 0
count(col1) sum(col2)
1234 123456789
345 6543210

or by group+sum() and then subtract one table's aggregation from the other table's aggregation and order by the difference.

More details, on the PostgreSQL case which is the same for this example, see Checking whether two tables have identical content in PostgreSQL.

questionto42
  • 7,175
  • 4
  • 57
  • 90
0

Recently I did it in this way:

SELECT STRCMP(
(SELECT GROUP_CONCAT(HEX(col1),HEX(col2),HEX(col3),....) FROM table1),
(SELECT GROUP_CONCAT(HEX(col1),HEX(col2),HEX(col3),....) FROM table2)
);

This way allow also compare just one attribute (is it same) from two tables. If the data is same but different record order it will result as 'not same', then further check can be done by sort then concat.

0
  1. Ensure that both tables have the same structure (i.e., same columns and data types).
  2. Check if both tables have the same number of rows. If they do not, then the tables cannot be identical.
  3. Use a query or a tool to compare the data in each row of the tables.
  4. If the tables are small, you can manually compare the rows one by one. Otherwise, you can use a SQL query to compare the tables. One way to do this is to use the EXCEPT operator to compare the rows in both tables. For example:
SELECT * FROM Table1
EXCEPT
SELECT * FROM Table2

This will return any rows that are in Table1 but not in Table2.

  1. Also, you can use the UNION operator to compare both tables. For example:
SELECT * FROM Table1
UNION
SELECT * FROM Table2

This will return a combined set of rows from both tables with duplicates removed. If the number of rows returned by this query is equal to the number of rows in each table, then the tables contain the same data.

By comparing the results of these queries, you can determine if both tables have exactly the same data.

Happy learning!

F. Müller
  • 3,969
  • 8
  • 38
  • 49