How to search in large (up to 50GB) sorted binary file using Python?

Question

The binary data file looks like this (string form of the binary content) '50.134|50.135|180.453|180.473|191.001|191.001...3000.3453', ~1B values in total.

Query: find offsets (indices) of the first value x1 >= 200.03 and the last value x2 <= 200.59.
Read: read the values between x1 and x2, ~1k values.

Ideally, the querying and reading shouldn't take longer than 200ms. The file cannot be hold in memory but rather on a disk (or even AWS S3).

What I have come up with so far. The file is split into chunks (e.g 5MB). The chunk's first and last values are stored in an index that is used to find corresponding chunks for a query. Next, the chunks are read into memory and in memory search is performed.

I'd be glad to hear about how others would approach the problem.

Thanks for your help!

Does this answer your question? [Lazy Method for Reading Big File in Python?](https://stackoverflow.com/questions/519633/lazy-method-for-reading-big-file-in-python) — Olvin Roght, Apr 20 '20 at 12:11
How is the file sorted? Is it sorted by a single "column"? If it is, which one? — boechat107, Apr 20 '20 at 12:20
Does every "binary" entry in the file have the same size ? If so, you can use a binary search algorithm by accessing entries at location index*entrySize. If not, you could create an in-memory index of chunks by reading the file once and keeping a list of every Nth entry's value and position (then use bisect_right and bisect_left to find the byte range that contains your value range) — Alain T., Apr 20 '20 at 13:09
@boechat107 I've described the simplest case where records with only one field x exist. But the same solution should work for an extended case where each record contains multiple fields x, y, z but the file is sorted by x. — intsco, Apr 20 '20 at 16:34
@AlainT. Yes, entries have the same size but I was concerned about the total delay of reading N random locations where N can go up to 32. Using an index of locations of every Nth entry is a sound idea. — intsco, Apr 20 '20 at 16:47
@OlvinRoght Not quite, I don't need to read the file in small chunks. I need to find and read a small slice of it. And it should be really fast. — intsco, Apr 20 '20 at 16:56
@intsco--since the file is in ascending order, how about performing a binary search to find the right chunk. The binary search algorithm is used to position the places to read in the file as in this [example](http://www.grantjenks.com/wiki/random/python_binary_search_file_by_line). — DarrylG, Apr 20 '20 at 17:28
@intsco, You have to read file in small chunks and perform search on each chunk. — Olvin Roght, Apr 20 '20 at 17:52

Alain T. · Accepted Answer · 2020-04-21T12:55:27.953

Here is an example (pseudo code, not tested) of how you could build a partial index of entries in your binary file that will allow you to access subranges efficiently without loading the whole file in memory and maximizing sequential reads:

import bisect

recordSize  = 16   # size in bytes of one record in the file
chunkSize   = 1024 # groups of 1K records (in number of records)
chunkIndex  = []   # indexing value of first record of chunk (for each chunk)

with open("testFile", "rb") as binaryFile:

    # build the partial index (chunkIndex) - only done once
    binaryFile.seek(0, 2)
    fileSize = binaryFile.tell()
    for position in range(0, fileSize, chunkSize * recordSize):
        binaryFile.seek(position)
        record     = binaryFile.read(recordSize)
        # use your own record/binary format conversion here
        chunkValue = int.from_bytes(record[:4],byteorder="little", signed=False)
        chunkIndex.append(chunkValue)


    # to access a range of records with values between A an B:
    firstChunk = bisect_left(chunkIndex,A) # chunk that will contain start value
    position   = firstChunk * chunksize * recordSize
    binaryFile.seek(position)
    while not binaryFile.eof: 
        records     = binaryFile.read(recordSize*chunkSize) # sequential read.
        for i in range(0,len(records),recordSize):
            record = records[i:i+recordSize)
            # use your own record/binary format conversion here
            value = int.from_bytes(record[:4],byteorder="little", signed=False)
            if value < A : continue
            if value > B : break
            # Process record here ...
        if value > B : break

You will need to play with the value of chunkSize to find a sweet spot that balances load-time/memory usage with data access time. Since your ranges will not always fall on chunk boundaries, in the worse case scenario, you may end up reading records that you don't want and have to skip over. On average you will be reading chunkSize/2 unnecessary records. This is where the difference in performance between sequential vs random access can pay off.

On a network drive, random access will be impacted by latency and sequential access is a function of bandwidth. In other words more requests require more round trip to the server (latency) and reading bigger chunks require more packets (bandwidth).

If you are using a HDD (or network drive), the sequential reads of multiple adjacent records will tend to be much faster (at least 20x) than random access and you should get some benefits from this partial indexing.
However, if your file is on an internal SSD, then a standard binary search directly in the file (without memory indexing) will perform faster.

With 1 Billion records, finding the first record's position would require 30 seek/read operations (2^30 > 1B). This means that, if you keep 16M entries in the chunk index, each chunk will correspond to 64 records. With 16 million keys in memory, you would save 24 out of the 30 seek/read operations that a straight binary search would need. This would be at the cost of 32 (on average) unnecessary sequential reads.

You may also chose to implement a hybrid of the two approaches to minimize disk access (i.e. use the partial index to find the chunk range and then binary search to pinpoint the exact position of the first record within the starting chunk). This would require only 6 seek/read operations to pinpoint the first record in the 64 record range indicated by the in-memory partial index.

In both approaches, once you've found the first record, the rest of the range is sequential reads from there until you reach the end of the range or end of file. If you expect to be reading the same records more than once, it may be possible to further optimize by keeping a cache of record ranges that you've read before and use it to get the data without going back to disk (e.g. by skipping over record reads that you have in the cache when you read sequentially)

How to search in large (up to 50GB) sorted binary file using Python?

1 Answers1