Can this be properly modeled with segment trees?

Question

The problem I'm working on requires processing several queries on an array (the size of the array is less than 10k, the largest element is certainly less than 10^9).

A query consists of two integers, and one must find the total count of subarrays that have an equal count of these integers. There may be up to 5 * 10^5 queries.

For instance, given the array [1, 2, 1], and the query 1 2 we find that there are two subarrays with equal counts of 1 and 2, namely [1, 2] and [2, 1].

My initial approach was using dynamic programming in order to construct a map, such that memo[i][j] = the number of times the number i appears in the array, until index j. I would use this in a similar way one would use prefix sums, but instead frequencies would accumulate.

Constructing this map took me O(n^2). For each query, I'd do an O(1) processing for each interval and increment the answer. This leads to a complexity of O((q + 1)n * (n - 1) / 2)) [q is the number of queries], which is to say O(n^2), but I also wanted to emphasize that daunting constant factor.

After some rearrangement, I'm trying to find out if there's a way to determine for every subarray the frequency count of each element. I strongly feel this problem is about segment trees and I've struggled with coming up with a proper model and this was the only thing I could think of.

However my approach doesn't seem to be too useful in this case, considering the complexity of combining nodes holding such a great amount of information, not to mention the memory overhead.

How can this be solved efficiently?

Peter de Rivaz · Answer 1 · 2017-07-19T20:02:59.947

Idea 1

You can reduce the time for each query from O(n^2) to O(n) by computing the frequency count of the cumulative count difference:

from collections import defaultdict

def query(A,a,b):
    t = 0
    freq = defaultdict(int)
    freq[0] = 1
    for x in A:
        if x==a:
            t+=1
        elif x==b:
            t-=1
        freq[t] += 1
    return sum(count*(count-1)/2 for count in freq.values())

print query([1,2,1],1,2)

The idea is that t represents the total discrepancy between the count of the two elements.

If we find two positions in the array with the same total discrepancy we can conclude that the subarray between these positions must have an equal number.

The expression count*(count-1)/2 simply counts the number of ways of choosing two positions from the count which have the same discrepancy.

Example

For example, suppose we have the array [1,1,1,2,2,2]. The values for the cumulative discrepancy (number of 1's take away number of 2's) will be:

0,1,2,3,2,1,0

Each pair with the same number, corresponds to a subarray with equal count. e.g. looking at the pair of 2s we find that the range from position 2 to position 4 has equal count.

Idea 2

If this is still not fast enough, you could optimize the query function to quickly skip over all elements that are not equal to a or b. For example, you could prepare a list for each element value that contains all the locations of that element.

Once you have this list, you can then instantly jump to the next location of either a or b. For all intermediate values we know the discrepancy will not change, so you can update the frequency by the number of skipped elements (instead of always adding just 1 to the count).

Can this be properly modeled with segment trees?

1 Answers1

Idea 1

Example

Idea 2