0

I am new to python & trying to compare two large CSV files (300 Million rows & 50 Columns). wondering how to do this in pandas if it is a better option. The input & output expected are given below

file 1:

key,field1,field2,field3
001,belgium,1000,123.56
002,usa,200,345.65
003,canada,3000,675.00

file 2:

key,field1,field2,field3
001,belgium,500,0
002,usa,200,345.65
004,Brazil,2500,458.00

output (with comparison indicators)
(s-same values, C-value changed, O-value changes from nonzero to zero, record deleted in new file, N- record newly added in new file)

Output expected:

key,field1,field2,field3
001,S,C,O
002,S,S,S
003,D,D,D
004,N,N,N
Soumendra
  • 1,518
  • 3
  • 27
  • 54
karthik
  • 11
  • 3
  • Welcome to SO. Please take the time to read [ask] and the other links found on that page. [Pandas has excellent documentation](http://pandas.pydata.org/pandas-docs/stable/index.html). Just dive in, follow the examples, adapt to your requirements. There is also the [Standard Library module `difflib`](https://docs.python.org/3/library/difflib.html#module-difflib). – wwii May 20 '18 at 16:36
  • try reading about the csv libray in python, you can start here: https://stackoverflow.com/questions/14091387/creating-a-dictionary-from-a-csv-file – messy212 May 20 '18 at 16:54
  • Does the row matter? (the order) – Anton vBR May 20 '18 at 16:55

0 Answers0