Randomize 153 million line file without running out of memory

Question

Hello i want to randomize the lines of a 153mil lines text file but the current way i use makes me run out of memory while doing it:

with open(inputfile,'r') as source:
    data = [ (random.random(), line) for line in source ]
    data.sort()
with open(outputfile,'w') as target:
    for _, line in data:
        target.write( line )

Where does the memory exception take place? Is it at the first line where you read in the text? — Trekkie, Aug 24 '18 at 19:49
@sophros only suffle KeithL while putting the lines in data arry — Thiplol, Aug 24 '18 at 19:52

score 3 · Answer 1 · answered Aug 24 '18 at 20:09

Using h5py, you can port your data file to a HDF5 format, and then randomize:

https://stackoverflow.com/a/44866734/3841261

You can use random.shuffle(dataset). This takes a little more than 11 minutes for a 30 GB dataset on my laptop with a Core i5 processor, 8 GB of RAM, and a 256 GB SSD

Trekkie · Answer 2 · 2018-08-24T20:15:01.210

Doing some rough back-of-the-napkin calculations, estimating 120 characters per line times 153 M lines... Works out to roughly 18.5 GB of data. (I'm assuming 1 byte per character, but it would be more due to Unicode however... you get the point, though). So you would need at least that amount of RAM in order to fully read in the text. That's why you're getting the out of memory exception on the read in.

One approach you could take is to divide the job up into chunks. Read in portions of the file, randomize those, and then append to a new file, writing to the file and clearing memory as you go. Of course the problem there is that you would only be randomizing within specific chunks.

There are many approaches you could take here, but you can't get around the fact that you can't read in all of that text at once if you don't have the memory.

Edit

I really like Chad's idea of using h5py and HDF5. It's essentially doing all the shuffling in a file on the hard drive... Kind of like forcing a hard drive swap but with more control. I dig it! It does require having h5py though.

there's another way: generate a shuffled list of indices, and read the file as many times as needed line by line (without storing) and stopping at the proper line. Of course this is going to take forever... as potentially reading 153M*(153M+1)/2 max times... — Jean-François Fabre, Aug 24 '18 at 20:00

score 1 · Answer 3 · answered Aug 30 '18 at 20:04

I've already posted an answer, but it's indeed sub-optimal because it implies creating another file.

Here's a simpler solution:

read the file a first time fully, but store only the offsets of the start of each line (not a lot of memory compared to the actual contents of the file, see estimation at the end of the answer)
shuffle those offsets
now open the file again, and seek to each of the shuffled offsets, reading one line at a time

We have to use binary mode else we could have issues with automatic end-of-line conversions.

import random

current_offset = 0
offsets = []
with open("input.txt","rb") as f:
    for line in f:
        offsets.append(current_offset)
        current_offset += len(line)

offsets.pop() # remove last offset (end of file)

random.shuffle(offsets)

with open("input.txt","rb") as f:
    for offset in offsets:
        f.seek(offset)
        print(f.readline().decode().rstrip  # or write to another file

For 153 million lines you still need about 1 to 1.5 gigabytes of RAM to store the indexes (python 3 uses long integers, you could store that in a numpy array instead to reduce memory). If this is acceptable, this is a very simple solution.

Jean-François Fabre · Accepted Answer · 2018-08-25T17:29:47.383

How I would do it without reading the data all in memory and without too much complexity:

First compute the maximum line length in your file (using binary mode)

with open(inputfile,'rb') as source:
    max_line_len = max(len(line) for line in source)

Then write another file to disk with the correct padding, so each line has exactly the same size (you need more than twice the size, but since you don't have the memory...). Count the lines at the same time.

with open(inputfile,'rb') as source, open(outputfile,'wb') as dest:
    for count,line in enumerate(source):
        dest.write(line + b"*"*(max_line_len-len(line))) # write padded

You just created a bigger file, but now the lines have exactly the same length. Well, we padded after the linefeed, which will be useful later. A sample output would be (if max len = 20 for instance):

the first line
****the second line
***another line
******

(not sure of the exact number of stars added but you get the idea, note that the padding char doesn't matter as long as it isn't \n)

it means that you can seek at the start of any line by a simple multiplication by max_line_len (like a records file, or a database)

now you can generate the list of line indexes:

indexes = list(range(count+1))
random.shuffle(indexes)

now iterate on this list of indexes, seek to the proper location, read one chunk and split using the fact that we padded after the linefeed, so now we can split it to discard the padding contents.

with open(outputfile,'rb') as source:
   for idx in indexes:
       source.seek(idx * max_line_len)
       random_line = source.read(max_line_len).decode().split("\n")[0]
       print(random_line) # or store to another file

I haven't tested this but it should work if you have enough disk. Of course, this is very wasteful if you have one very long line, and the rest are short.

some lines output single characters when they where 10+ characters originally — Thiplol, Aug 24 '18 at 22:31
@Thiplol oops sorry: 2 errors: writeline => write and the seek offset should be multiplied by `max_line_len`. Can you retry? — Jean-François Fabre, Aug 25 '18 at 17:30
thanks for accepting, but I think I can propose a better answer finally. I'll try to in a few days @Thiplol — Jean-François Fabre, Aug 26 '18 at 12:08

Randomize 153 million line file without running out of memory

4 Answers4