search a line in preprocessed big text file

Question

I have a data file which contains 100,000+ lines, each line just contains two fields, key and value split by comma, and all the keys are unique. I want to query value by key from this file. Loading it to a map is out of question as that consumes too much memory(code will run on embedded device) and I don't want DB involved. What I do so far is to preprocess the file in my PC, i.e., sort the lines, then use binary search like below in the preprocessed file:

public long findKeyOffset(RandomAccessFile raf, String key)
            throws IOException {
        int blockSize = 8192;
        long fileSize = raf.length();
        long min = 0;
        long max = (long) fileSize / blockSize;
        long mid;
        String line;
        while (max - min > 1) {
            mid = min + (long) ((max - min) / 2);
            raf.seek(mid * blockSize);
            if (mid > 0)
                line = raf.readLine(); // probably a partial line
            line = raf.readLine();
            String[] parts = line.split(",");
            if (key.compareTo(parts[0]) > 0) {
                min = mid;
            } else {
                max = mid;
            }
        }
        // find the right line
        min = min * blockSize;
        raf.seek(min);
        if (min > 0)
            line = raf.readLine();
        while (true) {
            min = raf.getFilePointer();
            line = raf.readLine();
            if (line == null)
                break;
            String[] parts = line.split(",");
            if (line.compareTo(parts[0]) >= 0)
                break;
        }
        raf.seek(min);
        return min;
    }

I think there are better solutions than this. Can anyone give me some enlightenment?

*"Loading it to a map is out of question as that consumes too much memory [...] What I do so far is to preprocess the file in my PC, i.e., sort the lines, then use binary search like below"* If your device has enough memory to sort the file content it also has enough memory to hold it in a map. — Timothy Truckle, Sep 26 '17 at 12:01
@TimothyTruckle I sort it on my PC, then copy it to the device. — jfly, Sep 26 '17 at 12:35
Store the data in a database suited for embedded devices e.g. https://www.sqlite.org/whentouse.html instead of in a flat file and use the database API to retrieve the data? — toongeorges, Sep 26 '17 at 12:47
@toongeorges thanks for the tip, but I don't take DB into account for now. Maybe I can use the DB's algorithm to index the file for latter query. — jfly, Sep 26 '17 at 12:53
Is the file changing - that is, are lines removed/added to it from time to time? If it is immutable, it makes it much easier to create optimisations. — Assafs, Sep 26 '17 at 14:23
@Assafs it's immutable, you can preprocess it to any format as long as no data lost. — jfly, Sep 26 '17 at 14:30
@jfly, I wrote a simple algorithm to optimise the retrieve value by key function. If you think it's worthwhile, would you consider accepting it? — Assafs, Sep 26 '17 at 14:52
When you say "don't want DB involved", what kind of DB are you mentioning? Because what you say makes sense for full-blown DBMS such as MariaDB, but it would seem like a file-only DB such as BoltDB or kyotocabinet would suit your use case well. — SirDarius, Sep 26 '17 at 15:15
@SirDarius Yeah, these k/v tools are cool, but my purpose is so simple, only query from an immutable text file, so I want to KISS. — jfly, Sep 26 '17 at 15:20

score 3 · Accepted Answer · answered Sep 26 '17 at 15:07

3

Data is immutable and keys are unique (as mentioned in the comments on the question).

A simple solution: Write your own hashing code to map key with the line number.

This means, leave the sorting and instead write your data to the file in the order that your hashing algorithm tells.

When key is queried, you hash the key, get the specific line number and then read the value.

In theory, you have an O(1) solution to your problem.

Ensure that the hashing algorithm has less collision, but I think that depending upon your exact case, a few collisions should be fine. Example: 3 keys map to the same line number so you write all three of them on the same line and when any of the collided keys is searched, you read all 3 entries from that line. Then do a linear (aka O(3) aka constant time in this case) search on the entire line.

answered Sep 26 '17 at 15:07

displayName

13,888
8
60
75

Yeah, that's what I thought before, hash to the file just like `HashMap` in memory. I google about it, all results are about hash of file, this method should be used by others. – jfly Sep 26 '17 at 15:16
@jfly: I didn't google your problem - it was only intuitive to me. Now, instead of putting the binary search code in your embedded device, you would have to write the hash based searching code. File should be the same size because data in the file is unchanged. And you clearly cannot do better than O(1) in time and space as is the case with this hash based solution. – displayName Sep 26 '17 at 15:21
Yeah, this reminds me of Hash Table Collision Handling I had studied in school, time flies! – jfly Sep 26 '17 at 16:16

Assafs · Answer 2 · 2017-09-27T06:24:44.753

An easy algorithm to optimise performance for your specific constraints:

let n be the number of lines in the original, immutable, sorted file.
let k < n be a number (we'll discuss ideal number later).
Divide the file into k files, with approximately equal number of lines in each (so each file has n/k lines). the files will be referred to as F1...Fk. If you prefer to keep the original file intact, just consider F1...Fk as line numbers within the file, cutting it into segments.
create a new file called P with k lines, each line i is the first key of Fi.
when looking for a key, first go with binary search over P using O(logk) to find which file /segment (F1...Fk) you need to go to. Then go to that file/segment and search within.
If k is big enough, then size of Fi (n/k) will be small enough to load to a HashMap and retrieve key with O(1). If it is still not practical, do a binary search of O(log(n/k)).

The total search will be O(logk)+O(log(n/k)), which is an improvement on O(logn) which is your original solution.

I would suggest to find a k that would be big enough to allow you to load a specific Fi file/segment into a HashMap, and not too big to fill up space on your device. The most balanced k it sqrt(n), which makes the solution run in O(log(sqrt(n))), but that may be quite a large P file. If you get a k which allows you to load P and Fi into a HashMap for O(1) retrieve, that would be the best solution.

Thanks for your idea, I would try it and think more methods. — jfly, Sep 26 '17 at 15:05
@jfly, Is there anything I can do to improve this solution for you? — Assafs, Sep 26 '17 at 15:06

score 1 · Answer 3 · answered Sep 27 '17 at 06:32

What about this?

#include <iostream>
#include <fstream>
#include <boost/algorithm/string.hpp>
#include <vector>

using namespace std;

int main(int argc, char *argv[])
{
    ifstream f(argv[1],ios::ate);
    if (!f.is_open())
        return 0;
    string key(argv[2]),value;

    int max = f.tellg();
    int min = 0,mid = 0;
    string s;
    while(max-min>1)
    {
        mid = min + (max - min )/2;
        f.seekg(mid);
        f >> s;
        std::vector<std::string> strs;

        if (!f)
        {
            break;
        }
        if (mid)
        {
            f >> s;
        }
        boost::split(strs, s, boost::is_any_of(","));
        int comp = key.compare(strs[0]);
        if ( comp < 0)
        {
            max = mid;
        }
        else if (comp > 0)
        {
            min = mid;
        }
        else
        {
            value = strs[1];
            break;
        }
    }
    cout<<"key "<<key;
    if (!value.empty())
    {
        cout<<" found! value = "<<value<<endl;
    }
    else
    {
        cout<<" not found..."<<endl;
    }

    f.close();
    return 0;
}

fair enough. However, to make it more useful for the original poster - would you consider posting it in Java, the language this question is tagged in? — Assafs, Sep 27 '17 at 06:57

search a line in preprocessed big text file

3 Answers3