I am looking for an efficient way of comparing data from database (mysql) tables and csvs. Preference is to use dataframes to compare the data and find out any "missing updates" in database.
CSVs range from 40MB to 3.5GBs and tables can be up-to few hundreds (200-300) of columns and current max row count on the biggest table is 600 million. My biggest concern and limitation is memory utilization. I have done this comparison to identify missing records on just the key columns, from database and csv, which worked fine but with maxed memory utilization. To identify any updates which were not applied in the database I need to do a full row compare which memory-wise would be quite expensive.
Is there a smarter way you would like to recommend or if you have done something similar with a custom script or a tool?
Any kind of guidance would be highly appreciated :)