How I can improve the performance of this file processing algorithm finding matches in a large textfile?

Question

I have a Textfile containing Gigabytes of integer triples:

357, 1325, 7085
448, 952, 1073
459, 555, 2091
756, 765, 925
765, 925, 3485
792, 1560, 3315
952, 1073, 1105
975, 1073, 1105
990, 1950, 2146

My task is to find quadruples a b c d out of two triples a b c and b c d. In other words, we need to find lines in that textfile where the two last elements in one line are the two first elements in another line (occurring later in that textfile. In the example above, such a quadruple would be:

448, 952, 1073, 1105

These found quadruples need to be written into another file. The following Python code is doing the job and it works correctly:

import sys
with open(sys.argv[1], 'r', encoding = 'utf-8') as f:
    m = {}
    a = []
    for line in f:
        if not line.strip():
            continue
        ns = list(map(int, line.split(',')))
        a.append(ns)
        key = tuple(ns[1:])
        if key not in m:
            m[key] = []
        m[key].append(ns)
    for ns in a:
        for e2 in m.get(tuple(ns[:-1]), []):
            print(', '.join(str(e) for e in (e2 + ns[-1:])))

But for larger files it does not scale. Currently I have to process a file that is 14GB large. How can we speed up that algorithm. Switching the programming language would be an option (since for example C++ has proven to be a highly performant). If it makes sense to switch to it, I would be very grateful for a corresponding code snippet.

any change to read this into a mysql (or other database) table, index it and then query (can be multi step) ? — Roger, Feb 02 '22 at 13:46
Unfortunatelly not, since the data generation is part of a math/data science pipeline that has no access to relational databases. — Eldar Sultanow, Feb 02 '22 at 13:47
sort by the second column and the matches will be consecutive. — , Feb 02 '22 at 13:52
Is this something you need to do multiple times per file or just once per file? — klutt, Feb 02 '22 at 13:52
Yes correct - it is one large file as input and we need one (not soo large) file containing the resulting quadruples as output. — Eldar Sultanow, Feb 02 '22 at 13:56
In what range are the numbers? Is it possible to preprocess the data in a different script? Like str to int in a script (that could be done in c++ if you want to) and the just read a long array of int in python — Finn, Feb 02 '22 at 14:44
Where does the text file come from? Would it be possible to integrate this in the generation process? — klutt, Feb 02 '22 at 15:13
These numbers are already large and they are still becoming larger, even up to `2^64` and more. — Eldar Sultanow, Feb 02 '22 at 15:14
If you can run Python, you do have access to a relational database - sqlite3 is part of the standard library. It is likely an efficient approach is to let sqlite store the numbers in a more compact form and build an index. Edit: Probably the large numbers are a problem, sqlite only does up to 8 byte numbers. — Yann Vernier, Feb 02 '22 at 15:19

Nineteendo · Answer 1 · 2022-02-02T15:19:14.997

1

You don't need to convert to numbers, you can just keep them as strings. Then you get something like this. Could you check if it's faster?

lines = '''357, 1325, 7085
448, 952, 1073
459, 555, 2091
756, 765, 925
765, 925, 3485
792, 1560, 3315
952, 1073, 1105
975, 1073, 1105
990, 1950, 2146'''.split("\n")

dic = {}
for line in lines:
    new = line.split(", ")
    check = new[0] + ", " + new[1]
    if check in dic:
        print(dic[check] + ", " + new[2])
    
    dic[new[1] + ", " + new[2]] = line

edited Feb 02 '22 at 15:19

answered Feb 02 '22 at 15:04

Nineteendo

882
3
18

Thank you for this answer - I will try this approach. But note: The file has already 424019514 lines and the file size is 14 GB (it will take some time :-). – Eldar Sultanow Feb 02 '22 at 15:22

Arty · Answer 2 · 2022-02-02T18:04:13.210

Following code hopefully works really fast, because it has linear O(N) algorithm of work. It uses std::unordered_map to store and find range of tuples candidates, and this map has O(1) time for search.

Either use following code as ./program input_file output_file or if arguments not provided then input.txt is used as input and output.txt is used as output.

Important Note - notice that at start of main() I write example text file. You should remove this file-writing block of code, otherwise it MAY overwrite existing file (although I tried to made it so that if file exists then it is not overwritten)! This file is written only as an example so that all visitors of StackOverflow may run program straight away and see results.

Try it online!

#include <cstdint>
#include <fstream>
#include <string>
#include <iostream>
#include <sstream>
#include <algorithm>
#include <vector>
#include <array>
#include <unordered_map>
#include <tuple>
#include <filesystem>
#include <cstdlib>
#include <chrono>
#include <cmath>

int main(int argc, char ** argv) {
    using u8 = uint8_t;
    using i64 = int64_t;
    using u64 = uint64_t;
    
    std::string fname(argc >= 2 ? argv[1] : "input.txt");
    
    {
        std::ifstream fin(fname);
        if (!fin.is_open()) {
            std::ofstream f(fname);
            std::string text = R"(
                357, 1325, 7085
                448, 952, 1073
                459, 555, 2091
                756, 765, 925
                765, 925, 3485
                792, 1560, 3315
                952, 1073, 1105
                975, 1073, 1105
                990, 1950, 2146
            )";
            f << text;
        }
    }
    
    std::ifstream f(fname);
    if (!f.is_open()) {
        std::cout << "Failed to open file '" << fname << "'." << std::endl;
        return -1;
    }
    auto const gtb = std::chrono::high_resolution_clock::now();
    auto Time = [gtb]() -> double {
        return std::llround(std::chrono::duration_cast<std::chrono::duration<double>>(
            std::chrono::high_resolution_clock::now() - gtb).count() * 1000.0) / 1000.0;
    };
    std::vector<std::array<u64, 3>> v;
    double tb = 0;
    {
        u64 const file_size = std::filesystem::file_size(fname);
        std::string text(file_size, ' ');
        f.read((char*)text.data(), text.size());
        u64 prev = 0;
        tb = Time();
        for (size_t icycle = 0;; ++icycle) {
            if ((icycle & ((1ULL << 24) - 1)) == 0)
                std::cout << "read " << (icycle >> 20) << " M " << (Time() - tb) << " sec, " << std::flush;
            u64 next = text.find('\n', prev);
            if (next == std::string::npos)
                next = file_size;
            u64 const
                first_comma = text.find(',', prev),
                second_comma = text.find(',', first_comma + 1);
            v.push_back({});
            std::array<u64, 3> poss = {prev, first_comma + 1, second_comma + 1};
            for (size_t i = 0; i < 3; ++i) {
                char * pend = nullptr;
                auto const val = std::strtoll(text.c_str() + poss[i], &pend, 10);
                if (val == 0) {
                    v.pop_back();
                    break;
                }
                v.back()[i] = val;
            }
            if (next >= file_size)
                break;
            prev = next + 1;
        }
    }
    /*
    for (size_t i = 0;; ++i) {
        if (i % 100'000 == 0)
            std::cout << "Read Line " << i / 1'000 << " K, " << std::flush;
        std::string line;
        std::getline(f, line);
        std::stringstream ss;
        ss.str(line);
        std::array<u64, 3> a{};
        char comma = 0;
        ss >> a[0] >> comma >> a[1] >> comma >> a[2];
        if (!f)
            break;
        if (a[2] == 0)
            continue;
        v.push_back(a);
    }
    std::cout << std::endl;
    */
    std::sort(v.begin(), v.end(),
        [](auto const & x, auto const & y) -> bool {
            return x < y;
        });
    
    struct Hasher {
        static u64 FnvHash(void const * data, size_t size, u64 prev = u64(-1)) {
            // http://www.isthe.com/chongo/tech/comp/fnv/#FNV-param
            u64 constexpr
                fnv_prime = 1099511628211ULL,
                fnv_offset_basis = 14695981039346656037ULL;
            
            u64 hash = prev == u64(-1) ? fnv_offset_basis : prev;
            
            for (size_t i = 0; i < size; ++i) {
                hash ^= ((u8*)data)[i];
                hash *= fnv_prime;
            }
            
            return hash;
        }
        
        size_t operator () (std::tuple<u64, u64> const & x) const {
            return FnvHash(&x, sizeof(x));
            //auto const h0 = h_(std::get<0>(x)); return ((h0 << 13) | (h0 >> (sizeof(h0) * 8 - 13))) + h_(std::get<1>(x));
        }
    };
    std::unordered_map<std::tuple<u64, u64>, std::tuple<size_t, size_t>, Hasher> m;
    std::tuple<u64, u64> prev = std::make_tuple(v.at(0)[0], v.at(0)[1]);
    size_t start = 0;
    tb = Time();
    for (size_t i = 0; i < v.size(); ++i) {
        if ((i & ((1ULL << 24) - 1)) == 0)
            std::cout << "map " << (i >> 20) << " M " << (Time() - tb) << " sec, " << std::flush;
        auto const next = std::make_tuple(v[i][0], v[i][1]);
        if (prev == next)
            continue;
        m[prev] = std::make_tuple(start, i);
        prev = next;
        start = i;
    }
    m[prev] = std::make_tuple(start, v.size());
    std::ofstream fout(argc >= 3 ? argv[2] : "output.txt");
    size_t icycle = 0;
    tb = Time();
    for (auto const & a: v) {
        if ((icycle & ((1ULL << 24) - 1)) == 0)
            std::cout << "find " << (icycle >> 20) << " M " << (Time() - tb) << " sec, " << std::flush;
        ++icycle;
        auto const it = m.find(std::make_tuple(a[1], a[2]));
        if (it == m.end())
            continue;
        for (size_t i = std::get<0>(it->second); i < std::get<1>(it->second); ++i)
            fout << a[0] << ", " << a[1] << ", " << a[2] << ", " << v[i][2] << std::endl;
    }
    return 0;
}

input.txt:

357, 1325, 7085
448, 952, 1073
459, 555, 2091
756, 765, 925
765, 925, 3485
792, 1560, 3315
952, 1073, 1105
975, 1073, 1105
990, 1950, 2146

output.txt:

448, 952, 1073, 1105
756, 765, 925, 3485

It also runs a risk of false positives and negatives due to overflows, as an OP comment noted the numbers may be larger than `u64` can hold. — Yann Vernier, Feb 03 '22 at 07:22
@YannVernier Actually that was I who wrote original program that generated tuples for OP, and OP created question regarding further processing of tuples. So I can confirm that tuples have no number above 2^36 for now, and will never exceed 2^48 for sure, because 2^48 computations will take ages. Probably it was some typo by OP in his question about that tuples exceed 2^64. Probably he wanted to say that squares of those numbers exceed 2^64, which is true. Anyway in case if for example we need to process tuples bigger than 2^64 then GCC/CLang have `__int128` type (128 bit integer). — Arty, Feb 03 '22 at 08:21

AsukaMinato · Answer 3 · 2022-02-02T19:15:21.430

0

use numpy maybe faster?

from io import StringIO

import numpy as np

S = StringIO(
    """357, 1325, 7085
448, 952, 1073
459, 555, 2091
756, 765, 925
765, 925, 3485
792, 1560, 3315
952, 1073, 1105
975, 1073, 1105
990, 1950, 2146"""
)

s = np.loadtxt(S, delimiter=",", dtype=int)

for row in s:
    t = s[(s[:, 1:] == row[:2]).all(axis=1)]
    if t.any():
        print(*t[0], row[-1], sep=", ")

edited Feb 02 '22 at 19:15

answered Feb 02 '22 at 18:58

AsukaMinato

1,017
12
21

How I can improve the performance of this file processing algorithm finding matches in a large textfile?

3 Answers3