I am working on creating a CSV file comparator which can compare files which are more than 300 MB. I have created a the first version which can Compare 2 Large CSV files in which the columns can be in any order but the rows need to be in the exact same order like File1:-
Header1,Header2,Header3,Header4
Value1,Value2,Value3,Value4
Value5,Value6,Value7,Value8
Value9,Value10,Value11,Value12
Value13,Value14,Value15,Value16
File2:-
Header1,Header3,Header2,Header4
Value1,Value3,Value2,Value4
Value5,Value7,Value3,Value8
Value9,Value11,Value10,Value12
Value13,Value15,Value14,Value16
So here what i have done is I am reading the files together and iterating over the rows one after the another converting it to an ordered Dic and comparing the results based on the key and value. It also has a provision where we can explicitly define which columns to match File1 with Columns in File2 and also to ignore which all not needed. On running this in a system and comparing files that has around 10,00,000 rows and 60 columns it is taking 30 minutes. I have now started working on a next version of the tool but here lies the problem in this case the rows and columns will be totally un-ordered like File1:-
Header1,Header2,Header3,Header4
Value1,Value2,Value3,Value4
Value5,Value6,Value7,Value8
Value9,Value10,Value11,Value12
Value13,Value14,Value15,Value16
File2:-
Header1,Header3,Header2,Header4
Value13,Value15,Value14,Value16
Value5,Value7,Value3,Value8
Value1,Value3,Value2,Value4
Value9,Value11,Value10,Value12
Considering that there will be a primary key, suppose in the above case let Header4
be the primary Key for both the files, what I am doing is I am creating Dictionary that will have the row number in File2 as the Key and the Header4 value as the value like
{'Value16':1,'Value8':2,'Value4':3,'Value12':4}
So in the next step I am iterating over the rows in File 1, picking up the primary key value for each row, making a lookup in the dictionary fetching the corresponding row number from the dictionary. Now iterating over File2 going to the specific row, creating an ordered Dic for both the rows from File1 and File2 comparing them and writing the results. The whole process is working as expected and is also giving proper comparison results. But when I tried comparing files that has around 10,00,000 rows and 60 columns it is taking more than 24 Hours to complete. To make the process a bit faster i tried loading the File2 in Memory and then compare it but still it is taking more than 20 Hours to complete. Any one can please help me or suggest me with a more efficient algorithm that can make the whole process substantially faster.
PS:- The system I am using for testing is i5 8th Gen with 8GB RAM and Windows 10 os. The full development is done in Python3.7