1

I am working on creating a CSV file comparator which can compare files which are more than 300 MB. I have created a the first version which can Compare 2 Large CSV files in which the columns can be in any order but the rows need to be in the exact same order like File1:-

Header1,Header2,Header3,Header4
Value1,Value2,Value3,Value4
Value5,Value6,Value7,Value8
Value9,Value10,Value11,Value12
Value13,Value14,Value15,Value16

File2:-

Header1,Header3,Header2,Header4
Value1,Value3,Value2,Value4
Value5,Value7,Value3,Value8
Value9,Value11,Value10,Value12
Value13,Value15,Value14,Value16

So here what i have done is I am reading the files together and iterating over the rows one after the another converting it to an ordered Dic and comparing the results based on the key and value. It also has a provision where we can explicitly define which columns to match File1 with Columns in File2 and also to ignore which all not needed. On running this in a system and comparing files that has around 10,00,000 rows and 60 columns it is taking 30 minutes. I have now started working on a next version of the tool but here lies the problem in this case the rows and columns will be totally un-ordered like File1:-

Header1,Header2,Header3,Header4
Value1,Value2,Value3,Value4
Value5,Value6,Value7,Value8
Value9,Value10,Value11,Value12
Value13,Value14,Value15,Value16

File2:-

Header1,Header3,Header2,Header4
Value13,Value15,Value14,Value16
Value5,Value7,Value3,Value8
Value1,Value3,Value2,Value4
Value9,Value11,Value10,Value12

Considering that there will be a primary key, suppose in the above case let Header4 be the primary Key for both the files, what I am doing is I am creating Dictionary that will have the row number in File2 as the Key and the Header4 value as the value like

{'Value16':1,'Value8':2,'Value4':3,'Value12':4}

So in the next step I am iterating over the rows in File 1, picking up the primary key value for each row, making a lookup in the dictionary fetching the corresponding row number from the dictionary. Now iterating over File2 going to the specific row, creating an ordered Dic for both the rows from File1 and File2 comparing them and writing the results. The whole process is working as expected and is also giving proper comparison results. But when I tried comparing files that has around 10,00,000 rows and 60 columns it is taking more than 24 Hours to complete. To make the process a bit faster i tried loading the File2 in Memory and then compare it but still it is taking more than 20 Hours to complete. Any one can please help me or suggest me with a more efficient algorithm that can make the whole process substantially faster.

PS:- The system I am using for testing is i5 8th Gen with 8GB RAM and Windows 10 os. The full development is done in Python3.7

Arijit Das
  • 99
  • 1
  • 4
  • 13
  • 1
    One option i think is sql express server which you can import all csv to it and then create join to compare the data. – Michael Seltene Jan 31 '19 at 19:44
  • @davedwards the problem statements are quite different, there the files are highly ordered in terms on rows and columns but here its is completely different – Arijit Das Jan 31 '19 at 20:32

2 Answers2

0

Now iterating over File2 going to the specific row, creating an ordered Dic for both the rows from File1 and File2 comparing them and writing the results.

This is what is eating your time. I think its making your solution take quadratic time.

What you can do is create two dictionaries one from each files.

Header1,Header2,Header3,Header4
Value1,Value2,Value3,Value4
Value5,Value6,Value7,Value8
Value9,Value10,Value11,Value12
Value13,Value14,Value15,Value16

So ^ becomes

{ 
Value4 : (1, Value1,Value2,Value3,Value4), 
Value8: (2, Value5,Value6,Value7,Value8), ... 
} 

Do same for the second file as well.

Now you can iterate over one dict and lookup value of row contents in the other dict. Row contents are in the dict so you don't have to crawl upto the line in the file. Since files are ~300MB the whole thing should fit in 8GB memory that you have.

Hope this helps.

Pratik Deoghare
  • 35,497
  • 30
  • 100
  • 146
0

If you have concern with bigger files which will take too much spaces in memory, you can break your problem in 2 steps, first prepare a Temporary file which is having all records in same position with matching primary keys, and simply compare file1 with temp file instead of file2,

do as below - 1. Make sure both files are sorted using the primary keys. 2. Start with file 1 and check first record primary key, keep iterating second file until you reach matching primary key in 2nd file or 2nd files primary key is > first file primary key. If primary keys match, write 2nd file's record to a temp file. If match not found AND 1st file's primary key < 2nd files primary key - write an empty row in temp file, go to next record in 1st file.

  1. Do step 3 until 1st file is exhausted. This way you will get 1st file and temp file with either same primary keys at same rows or an empty row.

  2. Simply compare File1 and Temp file reading one record a time.