0

I am looking for an efficient way of comparing data from database (mysql) tables and csvs. Preference is to use dataframes to compare the data and find out any "missing updates" in database.

CSVs range from 40MB to 3.5GBs and tables can be up-to few hundreds (200-300) of columns and current max row count on the biggest table is 600 million. My biggest concern and limitation is memory utilization. I have done this comparison to identify missing records on just the key columns, from database and csv, which worked fine but with maxed memory utilization. To identify any updates which were not applied in the database I need to do a full row compare which memory-wise would be quite expensive.

Is there a smarter way you would like to recommend or if you have done something similar with a custom script or a tool?

Any kind of guidance would be highly appreciated :)

smoshah
  • 55
  • 1
  • 9
  • 1
    why not load the csv in your database and compare it there? You won't have memory issues there. A table of 3.5 gb and another one of 300 x 600 mil rows is too big for in memory on an average laptop. – Erfan Mar 03 '20 at 22:36
  • There just once instance of the database which is Production system with live data being streamed in. Any additional load can cause deadlocks. – smoshah Mar 03 '20 at 23:51
  • @smoshah Can you push the records into some temp table and processing into mysql tmp tables – Santhosh Mar 04 '20 at 00:25

0 Answers0