1

Do you have an idea how to optimize the following function while still using std::sort(). It sorts the suffixes of a text to create a suffix array. I think the problem is in the compare function as not much more can be done for the rest. compare is basically the lexicographical < operator for a string.

I added the function I use to test the time in the main() and the segment is now reproducible. I hope the comments make it easier to understand the logic. The function works fine but is too slow because of the multiple ifs. Construction time is now around 101 ms (on my CPU), but we aim at around 70.

// compile with: g++ -std=c++17 -Wall -pedantic -fsanitize=address  -O3 test.cpp -o test
#include <string>
#include <vector>
#include <iostream>
#include <chrono>
#include <algorithm>

/// Build suffix array from text.
/// Input: the text (might be empty)
/// Output: a suffix array (sorted). The variable 'sa' is cleared before it's filled and returned.
void construct(std::vector<uint32_t>& sa, const std::string& text) {

    sa.clear();
    unsigned tsize = text.size();

    for (unsigned i = 0; i < tsize; ++i) {
        sa.push_back(i);
    }

    // with this function we compare the suffix at position 'a' to the suffix at position 'b' in the text.
    // we do this letter by letter calling compare recursively
    std::function<bool(uint32_t, uint32_t)> compare = [&](uint32_t a, uint32_t b) {

        if (a>tsize) return true;                 //if we reach the end of the vector on a it means a<b
        else if (b>tsize) return false;            // if we reach the end on b it means b>a
        else if (text[a] < text[b]) return true;    // case a<b
        else if (text[a] == text[b]) return compare(a + 1, b + 1); //if a and b are the same call compare on the next letters in both suffixes
        else return false;                                  // only the case b>a is left

    };

    std::sort(sa.begin(), sa.end(), compare);
}

// main tests the construction speed
int main(){

    std::vector<uint32_t> sa;

    srand(0);
    std::string big(100000, ' '); // 'large' random text
    for (auto& c : big) c = rand() % 128;
    // time the construction
    auto begin = std::chrono::steady_clock::now();
    construct(sa, big);
    auto end = std::chrono::steady_clock::now();

    size_t time = std::chrono::duration_cast<std::chrono::milliseconds>(end - begin).count();

    std::cout << "Construction time: " << time << "\n";

}
5kobrat
  • 45
  • 9
  • I am having a hard time understanding your compare logic. What are the contents of `sa` and `text`? It will be best if you can post a [mcve], with the sample input and expected output. – R Sahu Nov 24 '20 at 19:15
  • first line of your comparison function should be `if (a>=tsize) return true;` - you're going off the end of the string. second line similar. But if you want any other help you're gonna have to provide more information - as suggested by R Sahu above. It isn't even clear to me whether you want to _optimize_ or _correct_ it as we have no input or output ... – davidbak Nov 24 '20 at 19:27
  • Add, just before filling the `sa` vector: `sa.reserve(tsize);` Otherwise, the internal memory of the vector must be relocated many times when calling `push_back()` – Ripi2 Nov 24 '20 at 20:09

2 Answers2

0

I haven't profiled it, but - depending on how clever your compiler is - I think a non-recursive formulation might be more efficient:

std::function<bool(uint32_t, uint32_t)> compare = [&](uint32_t a, uint32_t b) {
        do
        {
            if (a>=tsize) return true;
            else if (b>=tsize) return false;
            else if (text[a] < text[b]) return true;
            else if (text[a] > text[b]) return false;
            a += 1;
            b += 1;
        } while (true);
    };
0

The code in compare compares strings according to their characters, in lexicographical order. This is very similar to how numbers are compared - most significant bits/bytes, if different, force the result of the comparison, and if equal, defer the result to less significant bits/bytes. So, if your strings are usually long, you can compare 8 bytes at once, using uint64_t.

To do that, you must pad your strings (with zero-valued bytes), to prevent out-of-bounds reads (alternatively, do the 8-byte comparison only when far from the string's end, otherwise do the original 1-byte comparison). Also, if your system is little-endian (most likely), you have to reverse bytes in your uint64_t numbers.

anatolyg
  • 26,506
  • 9
  • 60
  • 134