Counting merge sort with three permutations of 1, 2, ..., n, working in O(n log n)

Question

I am looking for an algorithm running in O(n log n), based probably on merge sort with counting. Which will give the number of such pairs in 3 strings(being permutations of the string 1, 2, 3, ..., n) that one number of the pair follows the other number of the pair(infinitely consecutive) in each of these 3 strings in the same order.

For example, for the strings:

1 2 4 3

4 1 2 3

4 3 1 2

There are only two pairs(1 and 2, 4 and 3) so the algorithm should return 2 because these are the only pairs that occur in the same order of succession in each of these 3 strings

And for example, for the strings:

1 2 3 4 5

4 2 1 5 3

3 4 5 2 1

It should return one because there is only one such pair 4 and 5

It is easy to solve this problem in O(n^2) complexity. But I am looking for a faster solution. I tried to find some dependency according to how many changes it takes to sort a given list, but with poor results

I'm giving my O(n^2) solution below:

// O(n ^ 2)

#include <iostream>
#include <vector>

using namespace std;

int main()
{
        int n;
        cin >> n;
        
        vector<int> ax(n);
        for (int i = 0; i < n; i++)
        {
            int tmp;
            cin >> tmp;
            tmp--;
            ax[tmp] = i;
        }
        
        vector<int> bx(n);
        for (int i = 0; i < n; i++)
        {
            int tmp;
            cin >> tmp;
            tmp--;
            bx[tmp] = i;
        }
        
        vector<int> cx(n);
        for (int i = 0; i < n; i++)
        {
            int tmp;
            cin >> tmp;
            tmp--;
            cx[tmp] = i;
        }
        
        long long r = 0;
        
        for (int i = 0; i < n - 1; i++)
            for (int j = i + 1; j < n; j++)
                r += (ax[i] < ax[j]) == (bx[i] < bx[j]) && (bx[i] < bx[j]) == (cx[i] < cx[j]);
                
        cout << r << "\n";
    return 0;
}

I think this counting of changes in list may help:

#include <iostream>
#include <vector>
#include <algorithm>

using namespace std;

typedef long long ll;

int main()
{
    int n;
    cin >> n;
    vector<int> a(n); 
    vector<int> t(n);
    for (int i = 0; i < n; i++)
    {
        cin >> a[i];
        a[i]--;
        t[a[i]] = i;
    }


    ll result = 0;
    int l = 1;
    while (l < n) // O(log n)
    {
        vector<int> p((n + l - 1) / 2 / l); 
        for (int i = 0; i < n; i++)
        {
            if (t[i] / l % 2 == 0 && t[i] / l / 2 < p.size())
                p[t[i] / l / 2]++;
            if (t[i] / l % 2 == 1)
                result += p[t[i] / l / 2];
        }

        l *= 2;
    }

    cout << result << "\n";

    return 0;
}

Why did you say *'based probably on merge sort"*? Did you get some hint indicating that or so? And why is O(n log n) the goal? — Kelly Bundy, Apr 14 '23 at 22:20
Yes I got a hint that the solution should use merge sorta. And O(n log n) because i think faster solution is impossible, but if you have any fast solutions be free to answer — Leon, Apr 16 '23 at 10:04
I actually meant the opposite: what makes you think that O(n log n) is possible? — Kelly Bundy, Apr 16 '23 at 10:30
Merge sort time complexity is O(n log n) and now when i tested yours O(n log^2 n) solutions it's still says it's too slow, but it's possible that my O(n log^2 n) solution is not the most optimal or should be something in between — Leon, Apr 16 '23 at 10:53
Your implementation doesn't look like O(n log² n) but like O(n²). I suspect a well-done actual O(n log² n) C++ solution will be fast enough. — Kelly Bundy, Apr 16 '23 at 11:09
Yes, sorry my implementation was actually O(n^2), now i'm trying to correct it — Leon, Apr 16 '23 at 11:33

Matt Timmermans · Accepted Answer · 2023-04-16T20:33:04.647

You can solve this in O(N log² N) by dividing the first list in half to create a sublist of first-half numbers and a sublist of second-half numbers. Then:

Separate the other two lists as well into first-half and second-half versions, where the first-half version contains the first-half numbers in the same order, and the second-half version contains the second-half numbers in the same order.
Count all the valid pairings of a first-half number with a second-half number. You can do this in O(N log N) time.
Recurse separately on the first-half numbers and second-half numbers to count the pairings within each sublist.

All that remains, then, is to figure out how to do step (2). To aid in visualization, I will refer each number as a point, its position in list 2 as "x", and its position in list 3 as "y".

So, to find valid pairs of first-half and second-half points, first sort the points in each half by y, and initialize an empty order statistic tree that sorts points by x.

Then iterate through all the points in y order (i.e., the original list 3):

For each first-half point, add it to the order statistic tree; and
For each second-half point, retrieve the number of points in the tree with smaller x, and add it to the total.

Note that for each bunch of points you count, they all occur before the second-half point in list1, because they're first-half points. They all occur before the second-half point in list3, because the came earlier in y-order. And they all occur before the second-half point in list 2, because that's what we used the order statistic tree for.

Here's a somewhat-optimized C++ implementation with tests + timings:

#include <iostream>
#include <vector>
#include <stdint.h>
#include <algorithm>
#include <chrono>

void partitionSection(std::vector<int>& A, size_t pos, size_t len, int splitVal, std::vector<int>& scratch) {
    scratch.clear();
    size_t d = 0;
    for (size_t i = 0; i < len; ++i) {
        int v = A[pos + i];
        if (v >= splitVal) {
            scratch.push_back(v);
        }
        else {
            A[pos + d] = v;
            ++d;
        }
    }
    size_t s = 0;
    for (;d < len; ++d, ++s) {
        A[pos + d] = scratch[s];
    }
}


uint64_t pairsIn3_helper(
    std::vector<int>& X, std::vector<int>& Y,
    std::vector<int> scratch,
    const int pos, const int len
) {
    if (len < 2) {
        return 0;
    }
    const int mid = len >> 1;
    uint64_t count = 0;

    //Use scratch as a compact order statistic tree.
    //It stores the number of items in each slot, followed by the
    //number of items in each pair of slots, etc.
    int levelStart = 0, levelLen = len;
    while (levelLen > 0) {
        levelStart += levelLen;
        levelLen >>= 1;
    }
    scratch.assign(levelStart + len, 0);
    const size_t xmapStart = levelStart;

    // Add a lookup for x coord by z coord
    for (int x = 0;x < len;++x) {
        scratch[xmapStart + X[x + pos] - pos] = x;
    }

    for (int y = 0;y < len;++y) {
        int z = Y[y + pos] - pos;
        int x = scratch[xmapStart + z];
        if (z < mid) {
            // add val to the tree
            levelStart = 0;
            levelLen = len;
            while (levelLen > x) {
                scratch[levelStart + x] += 1;
                x >>= 1;
                levelStart += levelLen;
                levelLen >>= 1;
            }
        }
        else {
            // count values < x
            levelStart = 0;
            levelLen = len;
            while (levelLen) {
                if (x & 1) {
                    count += scratch[levelStart + x - 1];
                }
                x >>= 1;
                levelStart += levelLen;
                levelLen >>= 1;
            }
        }
    }
    partitionSection(X, pos, len, pos + mid, scratch);
    partitionSection(Y, pos, len, pos + mid, scratch);
    count += pairsIn3_helper(X, Y, scratch, pos, mid);
    count += pairsIn3_helper(X, Y, scratch, pos + mid, len - mid);
    return count;
}

uint64_t pairsIn3(const std::vector<int>& X, const std::vector<int>& Y, std::vector<int>& Z) {
    if (Z.size() < 2) {
        return 0;
    }
    std::vector<int> scratch(Z.size() + 2, 0);

    // replace each integer with its Z-position
    for (size_t i = 0; i < Z.size(); ++i) {
        scratch[Z[i]] = (int)i;
    }
    std::vector<int> newX, newY;
    for (size_t i = 0; i < Z.size(); ++i) {
        newX.push_back(scratch[X[i]]);
        newY.push_back(scratch[Y[i]]);
    }
    return pairsIn3_helper(newX, newY, scratch, 0, Z.size());
}

uint64_t slow(const std::vector<int>& X, const std::vector<int>& Y, std::vector<int>& Z) {
    uint64_t count = 0;
    for (int z2 = 1; z2 < Z.size(); ++z2) {
        int val2 = Z[z2];
        size_t x2 = std::find(X.begin(), X.end(), val2) - X.begin();
        size_t y2 = std::find(Y.begin(), Y.end(), val2) - Y.begin();
        for (int z1 = 0; z1 < z2; ++z1) {
            int val1 = Z[z1];
            size_t x1 = std::find(X.begin(), X.end(), val1) - X.begin();
            size_t y1 = std::find(Y.begin(), Y.end(), val1) - Y.begin();
            if (x1 < x2 && y1 < y2) {
                ++count;
            }
        }
    }
    return count;
}


int main()
{
    auto sizes = std::vector<int>({ 1000, 2000, 5000, 10000, 20000, 50000, 100000, 200000, 500000, 1000000, 2000000, 3000000});
    for (auto N : sizes) {
        std::cout << "N=" << N << std::endl;
        std::vector<int> X, Y, Z;
        for (int i = 1; i <= N; ++i) {
            X.push_back(i);
            Y.push_back(i);
            Z.push_back(i);
        }
        std::random_shuffle(X.begin(), X.end());
        std::random_shuffle(Y.begin(), Y.end());
        std::random_shuffle(Z.begin(), Z.end());
        if (N < 10000) {
            auto t0 = std::chrono::high_resolution_clock::now();
            auto slowCount = slow(X, Y, Z);
            auto t1 = std::chrono::high_resolution_clock::now();
            auto slowMs = std::chrono::duration_cast<std::chrono::microseconds>(t1 - t0).count() * 0.001;
            std::cout << "slow impl found " << slowCount << " in " << slowMs << " ms" << std::endl;
        }
        auto t2 = std::chrono::high_resolution_clock::now();
        auto fastCount = pairsIn3(X, Y, Z);
        auto t3 = std::chrono::high_resolution_clock::now();
        auto fastMs = std::chrono::duration_cast<std::chrono::microseconds>(t3 - t2).count() * 0.001;
        std::cout << "fast impl found " << fastCount << " in " << fastMs << " ms" << std::endl;
    }
}

And results on a Ryzen 5600x. For lists of 200,000 elements, it takes 137ms:

N=1000
slow impl found 121406 in 129.437 ms
fast impl found 121406 in 0.414 ms
N=2000
slow impl found 505122 in 939.681 ms
fast impl found 505122 in 1.177 ms
N=5000
slow impl found 3205125 in 13789.7 ms
fast impl found 3205125 in 2.544 ms
N=10000
fast impl found 12642459 in 5.007 ms
N=20000
fast impl found 50166685 in 10.621 ms
N=50000
fast impl found 313241083 in 29.631 ms
N=100000
fast impl found 1252007335 in 63.935 ms
N=200000
fast impl found 5013227960 in 137.03 ms
N=500000
fast impl found 31346510805 in 383.266 ms
N=1000000
fast impl found 124806065267 in 826.671 ms
N=2000000
fast impl found 499940410615 in 1782.94 ms
N=3000000
fast impl found 1124759267734 in 2924.1 ms

Aside: I'm very happy to be able to provide this answer to your question, because only a few days ago, I provided essentially the same answer to a completely different question, and had to delete it when I realized that it was completely inappropriate! Strange coincidence. — Matt Timmermans, Apr 14 '23 at 22:53
Could you please help me understand why we need the order statistic tree? If I understand correctly, for step (2), we're trying to match for each element that appears in all three lists in the second-half, all elements that appear in all three lists in the first-half. Why can't we just count the number of elements that appear in all three lists on the first-half, and the number of elements that appear in all three lists on the second-half, and multiply those two counts? — גלעד ברקן, Apr 15 '23 at 11:54
@גלעדברקן Lets call the first list position z. In step 2, we're matching high-z points against low-z points, but there is no restriction on x or y. So a first-half number A might occurs at positions (9,10,1), while a second-half number B occurs at (10,11,12). (A,B) is a valid result, even though is both are in the second half of lists 2 and 3. — Matt Timmermans, Apr 15 '23 at 12:08
If we do step 2 in quadratic instead of linearithmic time (like I did on my answer's `list` version), the overall runtime is quadratic, right? — Kelly Bundy, Apr 15 '23 at 12:11
@KellyBundy Yes... but there are easier ways to get quadratic time. If you don't want to bother with an order statistic tree, you can do the subdivision trick again, this time reducing 2 dimensions to 1. — Matt Timmermans, Apr 15 '23 at 12:13
@MattTimmermans Quadratic yes, but also ***fast*** quadratic? At least in Python, that might not be so easy. Note the speed difference between the two quadratic solutions I have. — Kelly Bundy, Apr 15 '23 at 12:19
I implemented and tested both solutions in C++, but unfortunately both are too slow: for list lengths up to 200,000 , they work in a little over 5s, and I still need a faster solution, like i said probably working in O(n log n) time — Leon, Apr 16 '23 at 10:00
@Leon Barely faster than my non-optimized Python implementation? That seems odd. Maybe you didn't implement it well? I have some optimization ideas, but I doubt you actually need them. What's the goal, how fast do you need it? — Kelly Bundy, Apr 16 '23 at 10:25
I don't know how fast it should work, but i know it's too slow. I implemented this solution in few different ways in c++, i think the most optimal is this because i used Sorted List: https://gist.github.com/l299l/e20fc17603937591a72f13c971532b32 — Leon, Apr 16 '23 at 10:43
I have private tester, from some people, but i can't set it public and even i don't know tests or what is time that will pass tests, only think i know is the tests is up to 200,000 — Leon, Apr 16 '23 at 10:49
@Leon You made `left` a vector, so `find` takes linear time. Use a hash set instead. And probably better use a hash map for `findx` instead of a map. — Kelly Bundy, Apr 16 '23 at 10:52
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/253176/discussion-between-leon-and-kelly-bundy). — Leon, Apr 16 '23 at 10:53

Kelly Bundy · Answer 2 · 2023-04-15T13:36:04.380

A Python implementation of Matt's answer. The one using SortedList. The one using list is O(n²) instead of O(n log² n) because insertions take linear time, but it's still fast in practice and doesn't use a third-party package.

from itertools import combinations
import random
from sortedcollections import SortedList
from time import perf_counter as time
from bisect import bisect 

def naive(A, B, C):
    return len(set.intersection(*(
        set(combinations(s, 2))
        for s in [A, B, C]
    )))


def Matt_SortedList(A, B, C):
    if len(A) < 2:
        return 0
    m = len(A) // 2
    left = set(A[:m])
    findx = {b: x for x, b in enumerate(B)}
    xs = SortedList()
    count = 0
    for y, c in enumerate(C):
        x = findx[c]
        if c in left:
            xs.add(x)
        else:
            count += xs.bisect(x)
    count += Matt_SortedList(
        A[:m],
        [b for b in B if b in left],
        [c for c in C if c in left]
    )
    count += Matt_SortedList(
        A[m:], 
        [b for b in B if b not in left],
        [c for c in C if c not in left]
    )
    return count


def Matt_list(A, B, C):
    if len(A) < 2:
        return 0
    m = len(A) // 2
    left = set(A[:m])
    findx = {b: x for x, b in enumerate(B)}
    xs = []
    count = 0
    for y, c in enumerate(C):
        x = findx[c]
        i = bisect(xs, x)
        if c in left:
            xs.insert(i, x)
        else:
            count += i
    count += Matt_list(
        A[:m],
        [b for b in B if b in left],
        [c for c in C if c in left]
    )
    count += Matt_list(
        A[m:], 
        [b for b in B if b not in left],
        [c for c in C if c not in left]
    )
    return count


funcs = naive, Matt_SortedList, Matt_list

n = 1000
for _ in range(3):
    A, B, C = (
        random.sample(range(1, n+1), n)
        for _ in range(3)
    )
    for f in funcs:
        t = time()

        print(f(A, B, C), f'{(time()-t)*1e3:5.1f} ms ', f.__name__)
    print()

Outputs and times of three tests with random lists of length 1,000:

129600 780.8 ms  naive
129600  22.7 ms  Matt_SortedList
129600  14.2 ms  Matt_list

122571 741.4 ms  naive
122571  22.1 ms  Matt_SortedList
122571  13.7 ms  Matt_list

120176 698.5 ms  naive
120176  22.3 ms  Matt_SortedList
120176  13.8 ms  Matt_list

The fast solutions on lengths from 100,000 to 1,000,000:

100,000:
1252085208 4321.7 ms  Matt_SortedList
1252085208 2970.1 ms  Matt_list

200,000:
4988226746 10.18 s  Matt_SortedList
4988226746  7.80 s  Matt_list

500,000:
31262040091 28.40 s  Matt_SortedList
31262040091 27.09 s  Matt_list

1,000,000:
125081735840 63.35 s  Matt_SortedList
125081735840 80.81 s  Matt_list

@MattTimmermans `insert` takes linear time, `add` takes logarithmic time. Largest I tried was 500,000 and the `list` version was still slightly faster there (28.5 vs 29.5 seconds). While `insert` is linear, it has a very small constant factor, since that's essentially just a memmove. So apparently at these sizes, the n log² n part of the runtime still dominates the runtime, shadowing the n² part. — Kelly Bundy, Apr 15 '23 at 13:15
@MattTimmermans I added times for lengths up to 1,000,000 now, where you can see the `list` version's quadratic runtime beginning to show. — Kelly Bundy, Apr 15 '23 at 13:37

score 1 · Answer 3 · answered Apr 16 '23 at 20:41

It's other O(n log^2 n) solution. This uses segment trees. I have tested answers with O(n^2) solution, it works.

#include <iostream>
#include <vector>
#include <algorithm>

using namespace std;

class tree
{
  public:
    explicit tree(int n)
    {
        this->n = n;

        int size = 1;
        while (size < n)
            size *= 2;
        size = size * 2 - 1;
        t.resize(size);
    }

    void add(int i)
    {
        i = leaf(i);

        while (i != 0)
        {
            t[i]++;
            i = parent(i);
        }
        t[0]++;
    }
    int get(int i)
    {
        if (i == -1)
            return 0;

        if (i == n - 1)
            return t[0];

        int result = 0;

        int r = leaf(i) + 1;

        while (r > 0)
        {
            if (r % 2 == 0)
                result += t[r - 1];

            r = parent(r);
        }

        return result;
    }

  private:
    int n;
    vector<int> t;

    int leaf(int i) const
    {
        return i + (t.size() + 1) / 2 - 1;
    }
    static int parent(int i)
    {
        return (i - 1) / 2;
    }
};

typedef long long ll;

vector<int> ct;

bool op(int a, int b)
{
    return ct[a] < ct[b];
}

int main()
{
    int n;
    cin >> n;

    vector<int> a(n);
    vector<int> at(n);
    for (int i = 0; i < n; i++)
    {
        cin >> a[i];
        a[i]--;
        at[a[i]] = i;
    }
    vector<int> b(n);
    for (int i = 0; i < n; i++)
    {
        cin >> b[i];
        b[i]--;
    }

    ct.clear();
    ct.resize(n);
    for (int i = 0; i < n; i++)
    {
        int h;
        cin >> h;
        h--;
        ct[h] = i;
    }

    ll result = 0;
    int level = 1;

    while (level < n)
    {
        vector<tree> p;
        p.reserve((n + level - 1) / 2 / level);
        for (int i = 0; i < (n + level - 1) / 2 / level; i++)
            p.emplace_back(level);

        vector<int> m(n);
        for (int i = 0; i < p.size(); i++)
        {
            int l = i * level * 2;
            int r = i * level * 2 + level;

            while (r < i * level * 2 + level * 2 && r < n)
            {
                if (!op(a[l], a[r]) || l >= i * level * 2 + level)
                {
                    m[r] = l - 1 >= i * level * 2 ? (l - 1) % (level * 2) : -1;
                    r++;
                }
                else
                    l++;
            }
        }

        for (int i = 0; i < n; i++)
        {
            int x = b[i];

            if (at[x] / level % 2 == 0 && at[x] / level / 2 < p.size())
            {
                p[at[x] / level / 2].add(at[x] % (level * 2));
            }

            if (at[x] / level % 2 == 1)
            {
                int h = m[at[x]];
                if (h != -1)
                    h %= level;

                result += p[at[x] / level / 2].get(h);
            }
        }

        for (int i = 0; i < p.size(); i++)
            sort(a.begin() + i * level * 2, min(a.begin() + i * level * 2 + level * 2, a.end()), op);
        for (int i = 0; i < n; i++)
            at[a[i]] = i;

        level *= 2;
    }

    cout << result << "\n";

    return 0;
}

Counting merge sort with three permutations of 1, 2, ..., n, working in O(n log n)

3 Answers3