Most efficient way to convert large .txt files (size >30GB) .txt into .csv after pre-processing using Python

Question

I have data in a .txt file that looks like this (let's name it "myfile.txt"):

28807644'~'0'~'Maun FCU'~'US#@#@#28855353'~'0'~'WNB Holdings LLC'~'US#@#@#29212330'~'0'~'Idaho First Bank'~'US#@#@#29278777'~'0'~'Republic Bank of Arizona'~'US#@#@#29633181'~'0'~'Friendly Hills Bank'~'US#@#@#29760145'~'0'~'The Freedom Bank of Virginia'~'US#@#@#100504846'~'0'~'Community First Fund Federal Credit Union'~'US#@#@#

I have tried a couple of ways to convert this .txt into a .csv, one of them was using CSV library, but since I like Panda's a lot, I used the following:

import pandas as pd
import time
  
#time at the start of program is noted
start = time.time()

# We set the path where our file is located and read it
path = r'myfile.txt'
f =  open(path, 'r')
content = f.read()
# We replace undesired strings and introduce a breakline.
content_filtered = content.replace("#@#@#", "\n").replace("'", "")
# We read everything in columns with the separator "~" 
df = pd.DataFrame([x.split('~') for x in content_filtered.split('\n')], columns = ['a', 'b', 'c', 'd'])
# We print the dataframe into a csv
df.to_csv(path.replace('.txt', '.csv'), index = None)
end = time.time()
  
#total time taken to print the file
print("Execution time in seconds: ",(end - start))

This takes about 35 seconds to process, is a file of 300MB, I can accept that type of performance, but I'm trying to do the same for a way much larger file which size is 35GB and it produces a MemoryError message.

I tried using the CSV library, but the results were similar, I attempted putting everything into a list, and afterward, write it over to a CSV:

import csv
# We write to CSV
with open(path.replace('.txt', '.csv'), "w") as outfile:
    write = csv.writer(outfile)
    write.writerows(split_content)

Results were similar, not a huge improvement. Is there a way or methodology I can use to convert VERY large .txt files into .csv? Likely above 35GB?

I'd be happy to read any suggestions you may have, thanks in advance!

Your data seems to have 4 columns, but you only name 3? Your code actually fails with an error for exactly that reason - did you forget a column label, or did you forget an instruction to throw one column out? Or were you perhaps wanting to set the first as the index? — Grismar, Dec 02 '21 at 23:06
@Grismar indeed, that's a typo on my end, I edited the snippet in my question. Thank you! — Aquiles Páez, Dec 02 '21 at 23:10
Seems like your question is misnamed: the real problem is the memory error message. Processing 35GiB is never going to be fast. — DisappointedByUnaccountableMod, Dec 02 '21 at 23:43
@balmy That's one of the issues yeah, but mostly I'd like to know if there are more efficient ways of doing what I'm trying to achieve. Of course, nothing will ever read 35GB of data in the blink of an eye, but if I can find a more efficient way, I'll surely use that one. — Aquiles Páez, Dec 02 '21 at 23:50
Do any of the answers below work for you? I see you had some trouble running the first, we’re you able to work that out? This was a fun challenge… thanks for posting! — Zach Young, Dec 06 '21 at 19:07

Grismar · Answer 1 · 2021-12-03T00:06:12.323

3

Since your code just does straight up replacement, you could just read through all the data sequentially and detect parts that need replacing as you go:

def process(fn_in, fn_out, columns):
    new_line = b'#@#@#'
    with open(fn_out, 'wb') as f_out:
        # write the header
        f_out.write((','.join(columns)+'\n').encode())
        i = 0
        with open(fn_in, "rb") as f_in:
            while (b := f_in.read(1)):
                if ord(b) == new_line[i]:
                    # keep matching the newline block
                    i += 1
                    if i == len(new_line):
                        # if matched entirely, write just a newline
                        f_out.write(b'\n')
                        i = 0
                    # write nothing while matching
                    continue
                elif i > 0:
                    # if you reach this, it was a partial match, write it
                    f_out.write(new_line[:i])
                    i = 0
                if b == b"'":
                    pass
                elif b == b"~":
                    f_out.write(b',')
                else:
                    # write the byte if no match
                    f_out.write(b)


process('my_file.txt', 'out.csv', ['a', 'b', 'c', 'd'])

That does it pretty quickly. You may be able to improve performance by reading in chunks, but this is pretty quick all the same.

This approach has the advantage over yours that it holds almost nothing in memory, but it does very little to optimise reading the file fast.

Edit: there was a big mistake in an edge case, which I realised after re-reading, fixed now.

edited Dec 03 '21 at 00:06

answered Dec 02 '21 at 23:29

Grismar

27,561
4
31
54

Thanks for this approach, but not sure if I'm using it right. Spyder, the IDE that I use for Python, highlights this line as invalid syntax: while b := f_in.read(1): and I'm not sure why, I thought this was a valid operator for Python 3.9 Can you suggest an alternative to edit this one and try your approach? Thanks in advance. – Aquiles Páez Dec 02 '21 at 23:48
Try again with parentheses around it (I updated it) - if that doesn't work, you can just `while True:` followed by `b = f_in.read(1)` and `if not b: break`. I'm using 3.10, but you're right it should work on 3.9 as well. I think the syntax checker for Spyder may just be a bit off there... – Grismar Dec 03 '21 at 00:05
thanks, I attempted the suggestion but it seems to still be getting stuck at b = f_in.read(1), perhaps the file is just simply too large and regardless of what we do to write the csv, this method doesn't compensate because it takes time to read too? – Aquiles Páez Dec 03 '21 at 00:13
The only reason I can think of for it getting stuck if there's something really odd in the file itself. Does it write a part of the output and then get stuck? Or does it get stuck right at the start? Have you tried stepping through it with a debugger? – Grismar Dec 03 '21 at 00:22
1

@AquilesPáez, does it work on the small sample you provided in your question? I just ran this code on the sample and it works. – Zach Young Dec 03 '21 at 01:50

Zach Young · Accepted Answer · 2021-12-04T02:19:52.480

I took your sample string, and made a sample file by multiplying that string by 100 million (something like your_string*1e8...) to get a test file that is 31GB.

Following @Grismar's suggestion of chunking, I made the following, which processes that 31GB file in ~2 minutes, with a peak RAM usage depending on the chunk size.

The complicated part is keeping track of the field and record separators, which are multiple characters, and will certainly span across a chunk, and thus be truncated.

My solution is to inspect the end of each chunk and see if it has a partial separator. If it does, that partial is removed from the end of the current chunk, the current chunk is written-out, and the partial becomes the beginning of (and should be completed by) the next chunk:

CHUNK_SZ = 1024 * 1024

FS = "'~'"
RS = '#@#@#'

# With chars repeated in the separators, check most specific (least ambiguous)
# to least specific (most ambiguous) to definitively catch a partial with the
# fewest number of checks
PARTIAL_RSES = ['#@#@', '#@#', '#@', '#']
PARTIAL_FSES = ["'~", "'"]
ALL_PARTIALS =  PARTIAL_FSES + PARTIAL_RSES 

f_out = open('out.csv', 'w')
f_out.write('a,b,c,d\n')

f_in = open('my_file.txt')
line = ''
while True:
    # Read chunks till no more, then break out
    chunk = f_in.read(CHUNK_SZ)
    if not chunk:
        break

    # Any previous partial separator, plus new chunk
    line += chunk

    # Check end-of-line for a partial FS or RS; only when separators are more than one char
    final_partial = ''

    if line.endswith(FS) or line.endswith(RS):
        pass  # Write-out will replace complete FS or RS
    else:
        for partial in ALL_PARTIALS:
            if line.endswith(partial):
                final_partial = partial
                line = line[:-len(partial)]
                break

    # Process/write chunk
    f_out.write(line
                .replace(FS, ',')
                .replace(RS, '\n'))

    # Add partial back, to be completed next chunk
    line = final_partial


# Clean up
f_in.close()
f_out.close()

I will definitively study your solution, efficient File I/O operations is something I clearly understimated. Thanks for this! — Aquiles Páez, Dec 10 '21 at 01:26
You're welcome. And yeah, 30+GB is "a lot", and I'm new to these kinds of problems, too. I'm curious how often will you need to process a file like this... will it be an ongoing thing for you? — Zach Young, Dec 10 '21 at 03:14
Ahm, to be fair I'm not quite sure, it could be once a month per six months or so. First time ever working with these large files, and what I get from your comment is that it is not quite common, right? — Aquiles Páez, Dec 12 '21 at 13:30
The size is not uncommon, these days. The format with the multi-character delimiters is nothing I’ve seen before, and it makes processing trickier… do you know what program/process created it… why it isn’t just a real CSV to start with? — Zach Young, Dec 12 '21 at 18:33
I see, then is just my inexperience talking, but I'm glad to learn. And no, not sure which program or process created it, I asked, but actually, the person that provided the file to me just received it like that. — Aquiles Páez, Dec 13 '21 at 22:40

score 2 · Answer 3 · answered Jul 16 '22 at 20:43

Just to share an alternative way, based on convtools (table docs | github). This solution is faster the OP's, but ~7 times slower than Zach's (Zach works with str chunks, while this one works with row tuples, reading via csv.reader).

Still, this approach may be useful as it allows to tap into stream processing and work with columns, rearrange them, add new ones, etc.

from convtools import conversion as c
from convtools.contrib.fs import split_buffer
from convtools.contrib.tables import Table

def get_rows(filename):
    with open(filename, "r") as f:
        for row in split_buffer(f, "#@#@#"):
            yield row.replace("'", "")

Table.from_csv(
    get_rows("tmp.csv"), dialect=Table.csv_dialect(delimiter="~")
).into_csv("tmp_out.csv", include_header=False)

Most efficient way to convert large .txt files (size >30GB) .txt into .csv after pre-processing using Python

3 Answers3