-2

I have two files having 50 million rows each and of size 1.75GB each. I am unable to load it into google colab or my computer to run a python script to find the set difference (A-B). My computer and the colab notebook crash when I try to load the data.

How do I proceed further to extract the required information?

  • What columns and data-types are in your CSVs? Load the data into a DB and perform the difference there? – Iain Shelvington Jan 28 '22 at 20:22
  • @IainShelvington There is only One column in each CSV file containing a String of size 10 characters – Incompleteness Jan 28 '22 at 20:26
  • 2
    I'm curious, are these 10 characters mostly multi-byte (non-ASCII)? `50M rows * (10 chars + line-break)/row` is between 550M - 600M chars per CSV. That'd be 5-6 bytes per char to get to a 3GB CSV. Even if the file were UTF-32 encoded, I don't understand how the files could be that big. What kind of characters are present? – Zach Young Jan 28 '22 at 20:56
  • @Zach Young, On re-checking each row had a string of 32 chars and each CSV came up to 1.75 GB. – Incompleteness Jan 29 '22 at 00:11

3 Answers3

2

If you have no possibility to load file into memory, you can iterate over file B, calculate hash of each line and store it in a python set. Then you can iterate over lines of file A calculating hashes in the same way, keeping only those not present in set. It will run slow, but should run (as long as it is not single-line 3gb file).

import hashlib

b_hashes = set()
with open('fileB','rb') as fb:
    for line in fb:
        b_hashes.add(line)   # if line are short (<32 chars)
        #b_hashes.add(hashlib.md5(line).hexdigest()) #otherwise

with open('final_file.txt','wb') as f:
    with open('fileA','rb') as fa:
        for line in fa:
            if line not in b_hashes: # if lines are short
            #if hashlib.md5(line).hexdigest() not in b_hashes:  #otherwise
                f.write(line)
Maciej Wrobel
  • 640
  • 4
  • 11
  • As someone who is perhaps not as well versed in hashing as you are, I see you're just adding the hashes to the set. To my understanding, if one adds a string to a set, collisions are handled. What happens here in the event of a collision? Could you add an edit to motivate *why* you're storing the hashes rather than just adding the strings to the set, and explaining what happens if a collision occurs? – Kraigolas Jan 28 '22 at 20:35
  • 1
    because at time i've wrote answer there were no hint, that each row is short - and for anything bigger that 32 chars md5sum will be shorter and possibly could reduce structure size. Indeed for so short rows it would be more efficient to add just them. Edited answer to match your point. – Maciej Wrobel Jan 28 '22 at 20:38
0

Just use a while loop and load the lines one by one:

file1 = open ("file1.csv")
file2 = open ("file2.csv")
last1line = False
while not last1line:
    line1 = file1.readline ()#important, readline not readlines!!
    last1line = len (line1) == 0

    last2line = False
    while not last2line:
        line2 = file2.readline ()
        last2line = len (line2) == 0
        #compare here
    file2.seek (0)
Sagitario
  • 102
  • 8
0

Rling can do this for you in an extremely fast manner, in fact the fastest possible method I know of, if you have the memory it can do it in seconds or less. It recommends double the input file size in memory, so in your case, 3.5gb. There is a no memory usage option, it creates a database on your PC, however this way is slow(er).

https://github.com/Cynosureprime/rling

You can use it like this:

./rling file1 difference.txt file2

or windows

rling.exe file1 difference.txt file2

x34c4
  • 1
  • 2