Best way to compare two large files on multiple columns

Question

I am working on a feature which will allow users to upload two csv files, write the rules to compare the rows and output a result into a file.

Both files can have any number of columns and the columns name are also not fixed.

Currently, I read the files into two separate arrays and compare the rows based on the condition given in the rule.

This works for smaller files but for large ones, it takes a lot of time and memory to do the comparison.

Is there a better way where a DB can be utilized for storing and querying on schema-less data?

Example Data:

File1
type id  date       amount
A    1   12/10/2005 500
B    2   12/10/2005 500

File2
type id  date       amount
A    1   12/10/2005 500
B    2   12/10/2005 500
A    1   12/10/2005 500

Rule1  File1.type == File2.type && File1.amount == File2.amount

Rule2  File1.id == GroupBy(File2.id) && File1.amount == File2.TotalAmount

The match condition will be = Rule1 or Rule2

what type of data is it ? Can it be sorted or put in group following some criteria ? — LittlePanic404, May 06 '22 at 08:48
Can you include the code you're using to read the CSV files? — isaactfa, May 06 '22 at 08:51
@LittlePanic404 Yes, data can be sorted by date. Dataframe is simple having few columns like type, refId, date, amount. — Smogger, May 06 '22 at 08:53
Then you can try to split the array into subarray to make it faster. You can also use pandas module, they handle big datas and apparently can read directly csv files.I hope my comment helped you. — LittlePanic404, May 06 '22 at 11:22
Reading from csv is not a concern. Even if I split into subarrays everything is going to end up in memory and hence OOM. — Smogger, May 06 '22 at 13:28

Best way to compare two large files on multiple columns

0 Answers0