3

I am trying to batch a very large text file (approximately 150 gigabytes) into several smaller text files (approximately 10 gigabytes).

My general process will be:

# iterate over file one line at a time
# accumulate batch as string 
--> # given a certain count that correlates to the size of my current accumulated batch and when that size is met: (this is where I am unsure)
        # write to file

# accumulate size count

I have a rough metric to calculate when to batch (when the desired batch size) but am not so clear how I should calculate how often to write to disk for a given batch. For example, if my batch size is 10 gigabytes, I assume I will need to iteratively write rather than hold the entire 10 gigbyte batch in memory. I obviously do not want to write more than I have to as this could be quite expensive.

Do ya'll have any rough calculations or tricks that you like to use to figure out when to write to disk for task such as this, e.g. size vs memory or something?

Akshat Zala
  • 710
  • 1
  • 8
  • 23
Mason Edmison
  • 594
  • 3
  • 16
  • Why bother buffering? Can't you just write one line at a time? – Ilya Jun 15 '20 at 18:27
  • 1
    Wouldn't writing each line be extremely expensive? – Mason Edmison Jun 15 '20 at 18:30
  • 2
    [mmap](https://docs.python.org/3/library/mmap.html)it and let the OS figure it out. – sbabbi Jun 15 '20 at 18:31
  • 1
    You can open in binary with a larger `buffering` parameter, like perhaps 4 meg. Then a line by line read/write. You'll get good performance through the operating system disk cache. You can go faster with DIRECTIO but that's not so easy with python. – tdelaney Jun 15 '20 at 18:32
  • @sbabbi I have never used mmap before, could you elaborate a little? – Mason Edmison Jun 15 '20 at 18:33
  • A single line write goes into a local process buffer. When it reaches the "buffering" count (itd default is , I'm not sure, but pretty small) the buffer is flushed to the operating system cache. The OS cache pushes to the file system in the background as more memory is needed. The limiting factor is the disk io. Line by line writes are okay. – tdelaney Jun 15 '20 at 18:34
  • @tdelaney Thanks! This type of thing is a bit new for me. Could you provide a simple example? – Mason Edmison Jun 15 '20 at 18:37
  • Line by line will work but why waste time reading all those lines: Why don’t you bulk read+write large chunks up to slightly less than you desired batch size e.g. 10megs at a time, then read/write a line (or perhaps two) to make sure the split is on a line end? – DisappointedByUnaccountableMod Jun 15 '20 at 19:11
  • By read/write large chunks do you mean setting the "buffering" arg to 10 * int(1e7)? – Mason Edmison Jun 15 '20 at 19:21
  • Well, you could set the buffering to a large value but key thing is to simply read+write multiples of not lines but large chunks that you don’t need to split into lines e.g. 10megs until just less than your split size then do a readline to get the next line ending (or perhaps two lines if you’re worried about syncing with the binary content correctly) and write that additional little bit then close the output file and move to the next file. – DisappointedByUnaccountableMod Jun 15 '20 at 21:49
  • I posted my code to do this as an answer - for me/my SSD storage, it's I/O bound and ~5x faster than line-by-line :-) – DisappointedByUnaccountableMod Jun 18 '20 at 09:00

3 Answers3

1

I used slightly modificated version of this for parsing 250GB json, I choose how many smaller files I need number_of_slices and then I find positions where to slice a file (I always look for line end). FInally i slice file with file.seek and file.read(chunk)

import os
import mmap


FULL_PATH_TO_FILE = 'full_path_to_a_big_file'
OUTPUT_PATH = 'full_path_to_a_output_dir' # where sliced files will be generated


def next_newline_finder(mmapf):
    def nl_find(mmapf):
        while 1:
            current = hex(mmapf.read_byte())
            if hex(ord('\n')) == current:  # or whatever line-end symbol
                return(mmapf.tell())
    return nl_find(mmapf)


# find positions where to slice a file
file_info = os.stat(FULL_PATH_TO_FILE)
file_size = file_info.st_size
positions_for_file_slice = [0]
number_of_slices = 15  # say u want slice the big file to 15 smaller files
size_per_slice = file_size//number_of_slices

with open(FULL_PATH_TO_FILE, "r+b") as f:
    mmapf = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
    slice_counter = 1
    while slice_counter < number_of_slices:
        pos = size_per_slice*slice_counter
        mmapf.seek(pos)
        newline_pos = next_newline_finder(mmapf)
        positions_for_file_slice.append(newline_pos)
        slice_counter += 1

# create ranges for found positions (from, to)
positions_for_file_slice = [(pos, positions_for_file_slice[i+1]) if i < (len(positions_for_file_slice)-1) else (
    positions_for_file_slice[i], file_size) for i, pos in enumerate(positions_for_file_slice)]


# do actual slice of a file
with open(FULL_PATH_TO_FILE, "rb") as f:
    for i, position_pair in enumerate(positions_for_file_slice):
        read_from, read_to = position_pair
        f.seek(read_from)
        chunk = f.read(read_to-read_from)
        with open(os.path.join(OUTPUT_PATH, f'dummyfile{i}.json'), 'wb') as chunk_file:
            chunk_file.write(chunk)
1

Assuming your large file is simple unstructured text, i.e. this is no good for structured text like JSON, here's an alternative to reading every single line: read large binary bites of the input file until at your chunksize then read a couple of lines, close the current output file and move on to the next.

I compared this with line-by-line using @tdelaney code adapted with the same chunksize as my code - that code took 250s to split a 12GiB input file into 6x2GiB chunks, whereas this took ~50s so maybe five times faster and looks like it's I/O bound on my SSD running >200MiB/s read and write, where the line-by-line was running 40-50MiB/s read and write.

I turned buffering off because there's not a lot of point. The size of bite and the buffering setting may be tuneable to improve performance, haven't tried any other settings as for me it seems to be I/O bound anyway.

import time

outfile_template = "outfile-{}.txt"
infile_name = "large.text"
chunksize = 2_000_000_000
MEB = 2**20   # mebibyte
bitesize = 4_000_000 # the size of the reads (and writes) working up to chunksize

count = 0

starttime = time.perf_counter()

infile = open(infile_name, "rb", buffering=0)
outfile = open(outfile_template.format(count), "wb", buffering=0)

while True:
    byteswritten = 0
    while byteswritten < chunksize:
        bite = infile.read(bitesize)
        # check for EOF
        if not bite:
            break
        outfile.write(bite)
        byteswritten += len(bite)
    # check for EOF
    if not bite:
        break
    for i in range(2):
        l = infile.readline()
        # check for EOF
        if not l:
            break
        outfile.write(l)
    # check for EOF
    if not l:
        break
    outfile.close()
    count += 1
    print( count )
    outfile = open(outfile_template.format(count), "wb", buffering=0)

outfile.close()
infile.close()

endtime = time.perf_counter()

elapsed = endtime-starttime

print( f"Elapsed= {elapsed}" )

NOTE I haven't exhaustively tested this doesn't lose data, although no evidence it does lose anything you should validate that yourself.

Might be useful to add some robustness by checking when at the end of a chunk to see how much data is left to read, so you don't end up with the last output file being 0-length (or shorter than bitesize)

HTH barny

0

Here is an example of line-by-line writes. Its opened in binary mode to avoid the line decode step which takes a modest amount of time but can skew character counts. For instance, utf-8 encoding may use multiple bytes on disk for a single python character.

4 Meg is a guess at buffering. The idea is to get the operating system to read more of the file at once, reducing seek times. Whether this works or the best number to use is debatable - and will be different for different operating systems. I found 4 meg makes a difference... but that was years ago and things change.

outfile_template = "outfile-{}.txt"
infile_name = "infile.txt"
chunksize = 10_000_000_000
MEB = 2**20   # mebibyte

count = 0
byteswritten = 0
infile = open(infile_name, "rb", buffering=4*MEB)
outfile = open(outfile_template.format(count), "wb", buffering=4*MEB)

try:
    for line in infile:
        if byteswritten > chunksize:
            outfile.close()
            byteswritten = 0
            count += 1
            outfile = open(outfile_template.format(count), "wb", buffering=4*MEB)
        outfile.write(line)
        byteswritten += len(line)
finally:
    infile.close()
    outfile.close()
tdelaney
  • 73,364
  • 6
  • 83
  • 116