Memory management related issue while comparing files using pandas

Question

I have below python code which compares source system file with the target system file. Both the files are comma separated but can come with extension of .txt or .csv. However, the files will be comma separated. The code tries to list out unique records from source file (which are not in target file) and target file (which are not in source file). This code works well when the file size is less than 1GB. But for some reconciliation, the entire table data was sent which has million of records causing the file size to be around 4-5GB. As a result, when I execute the code in Pycharm, I get memory utilization error and the Pycharm is closed. Any insight on how to overcome the memory issue? I tried changing the heap size to 8096MB but even then I am getting the same issue.

File Layout (for both source system and target system):

12340012,01,78978912,12345678912345
1213456789,02,56789012,1234567890121

Here field 1 can be up to 10 chars, field 2 up to 2 chars, field 3 up to 8 chars, field 4 up to 14 chars

Python Code:

#load data to Dframe Object
print("Load started for DataFrame - Target File", datetime.datetime.now())
df_target=pd.read_csv(TargetFile, names=["CardholdNumber", "MemberNo", "CrdSponsorNo"],
                       dtype={"CardholdNumber": str, "MemberNo": str, "CrdSponsorNo": str})
print("Load ended for DataFrame - Target File", datetime.datetime.now())
print('//')


print("Load started for Header in DataFrame - Target File", datetime.datetime.now())
df_rxods10.insert(0, "Source", "rods10")
print("Load ended for Header in DataFrame - Target File", datetime.datetime.now())
print('//')


print("Load started for DataFrame - Source File", datetime.datetime.now())
df_source=pd.read_csv(SourceFile, names=["CardholdNumber", "MemberNo", "CrdSponsorNo"],
                      dtype={"CardholdNumber": str, "MemberNo": str, "CrdSponsorNo": str})
print("Load ended for DataFrame - Source File", datetime.datetime.now())
print('//')


print("Load started for Header in DataFrame - Source File", datetime.datetime.now())
df_prod10.insert(0, "Source", "prod10")
print("Load ended for Header in DataFrame - Source File", datetime.datetime.now())
print('//')


#vertical con
print("DataFrame Concatenation started", datetime.datetime.now())
df=pd.concat([df_target,df_source],axis=0)
print("DataFrame Concatenation ended", datetime.datetime.now())
print('//')


print("DataFrame sorting started", datetime.datetime.now())
df1=df.sort_values(by=["CardholdNumber", "MemberNo", "CrdSponsorNo"])
print("DataFrame sorting ended", datetime.datetime.now())
print('//')


print("DataFrame drop_duplicate started", datetime.datetime.now())
newdf = df1.drop_duplicates(subset=["CardholdNumber", "MemberNo", "CrdSponsorNo"], keep=False)
print("DataFrame drop_duplicate ended", datetime.datetime.now())
print('//')


print("DataFrame write to output file started", datetime.datetime.now())
newdf.to_csv(diffFile)
print("DataFrame write to output file ended", datetime.datetime.now())
print('//')

I tried changing the heap size to 8096MB but even then getting the same memory related issue

I would start by reframing the solution. I would stream read the rows from each file and store the row hashes (keys) and row indexes (values) in two dictionaries. Then you can use the set differences of the keys to determine rows in one file that are not in the other. The unique keys with row indexes allows you to get the full rows from the original files. — JonSG, May 19 '23 at 13:35
@JonSG - I am new to hashes, so can you please help on how hashes can be created? — Rahul Dubey, Jul 05 '23 at 12:53

score 0 · Answer 1 · answered Jul 05 '23 at 14:29

This might help get you started. We are going to use the hashlib.sha256() of each row to construct a key and then use it to determine unique rows in each dataset.

Hopefully we only need to have a dictionary of these hashes from dataset 1 as the high water mark.

import random
import string
import hashlib
import csv

## ---------------------------
## Create a bunch of random test rows
## ---------------------------
data = [
    [
        "".join(random.choice(string.digits) for _ in range(10)),
        "".join(random.choice(string.digits) for _ in range(2)),
        "".join(random.choice(string.digits) for _ in range(8)),
        "".join(random.choice(string.digits) for _ in range(14)),
    ]
    for _ in range(100)
]
## ---------------------------

## ---------------------------
## create test data files
## ensuring that each file has at least one distinct row.
## ---------------------------
with open("data_a.txt", "w", newline="") as file_out:
    csv.writer(file_out).writerows(data[5:])

with open("data_b.csv", "w", newline="") as file_out:
    csv.writer(file_out).writerows(data[:-5])
## ---------------------------

## ---------------------------
## Read the first file and use the hash of the row
## as a dictionary key with the value of the index of that row
## ---------------------------
distinct_a = {}
with open("data_a.txt", "r") as file_in:
    for index, row in enumerate(file_in):
        key = hashlib.sha256(row.encode()).digest()
        distinct_a[key] = index

## ---------------------------
## Read the second file and hash the row.
## if that hash is in the dictionary remove that key
## otherwise it is distinct to the second file
## ---------------------------
distinct_b = []
with open("data_b.csv", "r") as file_in:
    for row in file_in:
        key = hashlib.sha256(row.encode()).digest()
        if key in distinct_a:
            del distinct_a[key]
        else:
            distinct_b.append(row)
## ---------------------------

## ---------------------------
## re-read the first file to get the distict
## rows by index. At this point the dictionary
## should only have entries distinct to it
## ---------------------------
with open("data_a.txt", "r") as file_in:
    distinct_a = [
        line
        for index, line
        in enumerate(file_in)
        if index in distinct_a.values()
    ]
## ---------------------------

print("Distinct rows of data_a.txt")
for row in distinct_a:
    print(f"\t{row.strip()}")

print("Distinct rows of data_b.csv")
for row in distinct_b:
    print(f"\t{row.strip()}")

That should give you something like:

Distinct rows of data_a.txt
        1334346746,66,62164494,71677917539944
        3565821459,34,49828768,96013202928584
        1092276254,71,02236156,81649689378691
        5227992584,84,65952571,24866977701386
        5246876395,96,83509545,66553005639782

Distinct rows of data_b.csv
        3116989960,09,95653733,23178941469168
        2693792822,43,68213056,47212574630967
        7087320938,80,47462357,99362176025434
        9527865125,68,52772900,55461281823636
        0039797355,42,40483637,14285794991838

Memory management related issue while comparing files using pandas

1 Answers1