1

I'm migrating data from SQL Server 2017 to Postgres 10.5, i.e., all the tables, stored procedures etc.

I want to compare the data consistency between SQL Server and Postgres databases after the data migration.

All I can think of now is using Python Pandas and loading the tables into data frames from SQL Server and also Postgres and compare the data frames.

But the data is around 6 GB which takes much time for loading table into the data frame and also hosted on a server which is not local to where I'm running the Python script. Is there any way to efficiently compare the data consistency across SQL Server and Postgres?

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
Underoos
  • 4,708
  • 8
  • 42
  • 85
  • You don't need Pandas to copy data, use SQL Server's SSIS to copy data from one database to another. It supports a lot of data sources. At the very least you can work with Postgres using the ODBC or OLEDB drivers. – Panagiotis Kanavos Nov 04 '19 at 11:36
  • SQL Server can also connect to other databases as [linked servers](https://learn.microsoft.com/en-us/sql/relational-databases/linked-servers/linked-servers-database-engine?view=sql-server-ver15). *Comparing data* though has to be done in one of the two databases. Even with a JOIN, the *remote* server will filter its data and send it to the local server for comparison – Panagiotis Kanavos Nov 04 '19 at 11:37
  • BTW what do you mean data consistency? It's not just "identical data", especially when the data types don't match, or the source or target database is missing features. Never mind Unicode and collations. – Panagiotis Kanavos Nov 04 '19 at 11:41
  • A quick&dirty way would be to bulk export the data from both databases to CSV files per table with the same format and compare them. Simple size and checksum checks are enough to identify tables with differences. After that you can proceed to line-by-line comparisons using shell tools. – Panagiotis Kanavos Nov 04 '19 at 11:43
  • Exporting and comparing would be much tedious as there is so much data to compare. My colleagues have already migrated the data from `SQL Server` to `Postgres DB` and I'm left with the task of verifying the correctness of data in tables. Is there anything comparing hashes of tables on `SQLServer` and `Postgres`? – Underoos Nov 04 '19 at 11:49
  • How is this different from *exporting* the data to *Pandas* in a slow way (as query results) then trying to compare rows using a slow API? In any case you still haven't explained what `correctness` means for *you*. In a database this means that all constraints are met, all rows respect their foreign key constraints. It does *not* mean comparing original and target data. – Panagiotis Kanavos Nov 04 '19 at 11:54
  • As for hashes of tables, that's exactly why I mentioned text files. Bulk exporting locally is orders of magnitude faster than querying, transporting the data over the wire and loading dataframes. File size checks will quickly show guaranteed mismatches as a first step, while hash/checksum checks will show which tables have differences even though they have the same size. Simple scripts that calculate hashes per row can also be used to quickly find mismatched rows. – Panagiotis Kanavos Nov 04 '19 at 11:58
  • A different option, is to write a program that pulls data from *both* sources in the same order and compares them row by row, by key and possibly values, **in a streaming manner** - that means no dataframes. This way you can detect mismatches as soon as the rows are read. If the keys of the current rows are different, you know you have missing rows. If they are same, you can start comparing values, either by direct comparison or hashes. The problem with this technique is that the data streams *have* to have the same order. Another problem is handling multiple missing rows – Panagiotis Kanavos Nov 04 '19 at 12:02
  • To handle missing rows and proceed you'd have to look ahead a few rows until you find the next match, in either source. That's similar to moved block detection in file comparison tools. – Panagiotis Kanavos Nov 04 '19 at 12:03
  • 2
    Another quick check is to calculate row counts and key statistics (at least min max) for PKs for all tables in both databases. If you find tables with different counts, you know where to look for differences. – Panagiotis Kanavos Nov 04 '19 at 12:04

2 Answers2

1

Yes, you can order the data by primary key, and then write the data to a json or xml file.

Then you can run diff over the two files.

You can also run this chunked by primary-key, that way you don't have to work with a huge file.

Log any diff that doesn't show as equal. If it doesn't matter what the difference is, you could also just run MD5/SHA1 on the two file chunks, and if the hash machtches, there is no difference, if it doesn't, there is.

Speaking from experience with nhibernate, what you need to watch out for is:

  • bit fields
  • text, ntext, varchar(MAX), nvarchar(MAX) fields
    (they map to varchar with no length, by the way - encoding UTF8)
  • varbinary, varbinary(MAX), image (bytea[] vs. LOB)
  • xml
  • that all primary-key's id serial generator is reset after you inserted all data in pgsql.

Another thing to watch out is which time zone CURRENT_TIMESTAMP uses.

Note:
I'd actually run System.Data.DataRowComparer directly, without writing data to a file:

static void Main(string[] args)
{
    DataTable dt1 = dt1();
    DataTable dt2= dt2();
    IEnumerable<DataRow> idr1 = dt1.Select();
    IEnumerable<DataRow> idr2 = dt2.Select();

    // MyDataRowComparer MyComparer = new MyDataRowComparer();
    // IEnumerable<DataRow> Results = idr1.Except(idr2, MyComparer);
    IEnumerable<DataRow> results = idr1.Except(idr2);
}

Then you write all non-matching DataRows into a logfile, for each table one directory (if there are differences).

Don't know what Python uses in place of System.Data.DataRowComparer, though.
Since this would be a one-time task, you could also opt to not do it in Python, and use C# instead (see above code sample).

Also, if you had large tables, you could use DataReader with sequential access to do the comparison. But if the other way cuts it, it reduces the required work considerably.

Stefan Steiger
  • 78,642
  • 66
  • 377
  • 442
  • A CSV file would probably be better than either JSON or XML, unless each JSON/XML element is serialized as a single row fragment. There's no need for fidelity, just something that quickly detects differences, so there's no need to parse the line. Moved block detection in a diff tool can also handle missing rows (up to a point). Finally, row-per-line allows easy parallelization for very big tables, by splitting the table across lines – Panagiotis Kanavos Nov 04 '19 at 12:39
  • @Panagiotis Kanavos: CSV doesn't guarantee all rows are on a single line. If a text-field contains a line-break, then you shift everything by 1 line. – Stefan Steiger Nov 04 '19 at 12:42
  • JSON and XML will also have trouble with that too, unless the data is encoded, which will slow things down. This can be solved by using an unusual character as field and row separator, eg ¦, ¤ or ¶ – Panagiotis Kanavos Nov 04 '19 at 12:43
  • @Panagiotis Kanavos: In general, the file format doesn't matter. What matters is that your file-format-reader can run foreach table foreach row foreach column table1.row[column] == table2.row[column] efficiently and RELIABLY. Actually, might be fastest and the most reliable if you do it in raw ADO.NET directly on both servers. – Stefan Steiger Nov 04 '19 at 12:46
  • And with minimal parsing. Which is why I mentioned XML/JSON fragments instead of well-formed documents. Instead of parsing an entire million-row JSON/XML document, the diff tool only needs to parse a single line from each source – Panagiotis Kanavos Nov 04 '19 at 12:49
  • @Panagiotis Kanavos: I'd just do it without any files, that would seem to be the most reliable way. – Stefan Steiger Nov 04 '19 at 12:50
0

Have you considered making your SQL Server data visible within your Postgres with a Foreign Data Wrapper (FDW)?

https://github.com/tds-fdw/tds_fdw

I haven't used this FDW tool but, overall, the basic FDW setup process is simple. An FDW acts like a proxy/alias, allowing you to access remote data as though it were housed in Postgres. The tool linked above doesn't support joins, so you would have to perform your comparisons iteratively, etc. Depending on your setup, you would have to check if performance is adequate.

Please report back!

Morris de Oryx
  • 1,857
  • 10
  • 28