-1

I have Two 5GB CSV files with 10 Columns, I need to perform update/Insert logic and generate a final CSV by comparing both CSV files.

How to do it in Python Pandas?

Ex:

enter image description here

If you have any alternatives solutions to do the job, let me know

  • If you the compute power and memory, you can do it. If not, we need to know what upsert/insert logic you'd want to do to better give a solution. – Irfanuddin Feb 07 '22 at 08:57
  • I need to update the company database collection that exists in MongoDB. Now new company database is in CSV & JSON format. I want to update the company DB collection using the new CSV file. If a domain name is the same I want to update the fields which exist in the new CSV file. if domain is missing then insert complete row – raju kancharla Feb 07 '22 at 09:12
  • A better approach would be to perform this operation in mongo directly. Keep the old database as it is, read the new CSV file in chunked manner, compare with old db and perform an update query. – Irfanuddin Feb 07 '22 at 09:35

1 Answers1

1

Try using the isin() method or the merge() method to compare the 2 csv files.

import pandas as pd
csv1 = pd.read_csv("file1.csv")
csv2 = pd.read_csv("file2.csv")

#comparing the data using isin()

result = csv1[csv1.apply(tuple,1).isin(csv2.apply(tuple,1))]
print(result)

#comparing the data using merge()

result2 = csv1.merge(csv2, indicator=True, how='outer').loc[lambda v : V['_merge'] != 'both']
print(result2)

To update or insert into csv files, check out the following link. Updating Values in csv files

  • This is correct, but OP has constraint, their files are ~5 GB. – Irfanuddin Feb 07 '22 at 09:36
  • Thanks for your response. will merge be an optimistic way? Currently, I have a 5GB Collection in MongoDB approx 9M records. New data size in 3GB approx of 1M records. How long it will take for execution. I have some ideas like, Making 3GB records to JSON and iterating 1M records with the index of domain Name might execute faster and MongoDB has an inbuilt upsert feature as well. If I use GPU. will merge/isin work faster even in future if data gets gradually increase? – raju kancharla Feb 07 '22 at 10:18
  • For your case, merge is not advisable. As I mentioned in the comments under question, you can leverage Mongo's capabilities. – Irfanuddin Feb 08 '22 at 20:00