Total numbers having frequency k in a given range

Question

How to find total numbers having frequency=k in a particular range(l,r) in a given array. There are total 10^5 queries of format l,r and each query is built on the basis of previous query's answer. In particular, after each query we increment l by the result of the query, swapping l and r if l > r. Note that 0<=a[i]<=10^9. Total elements in array is n=10^5.

My Attempt:

n,k,q = map(int,input().split())
a = list(map(int,input().split()))
ans = 0
for _ in range(q):
    l,r = map(int,input().split())
    l+=ans
    l%=n
    r+=ans
    r%=n
    if l>r:
        l,r = r,l
    d = {}
    for i in a[l:r+1]:
        try:
            d[i]+=1
        except:
            d[i] = 1
    curr_ans = 0
    for i in d.keys():
        if d[i]==k:
            curr_ans+=1
    ans = curr_ans
    print(ans)

Sample Input:
5 2 3
7 6 6 5 5
0 4
3 0
4 1

Sample Output:
2
1
1

Welcome to Stack Overflow. Please, include the code (or algorithm description) for your attempt, and explain what is the issue with it. Also, if you can add some example of input and expected output that would be really helpful to fully understand the problem you are trying to solve. — jdehesa, May 01 '19 at 11:51
Not a significant change but you can simplify the second half of your algorithm if you add `from collections import Counter` at the beginning and do `d = Counter(a[l:r+1]); ans = sum(1 for v in d.values() if v == k)` — jdehesa, May 01 '19 at 12:29
@jdehesa Thanks! But what you did is just code optimization. Any data structure or algorithm that can reduce the **Time complexity** will be of great help. — Mohan Singh, May 01 '19 at 12:40
@MohanSingh Yes, no, I understand that, it's just a simplification of the code (it's barely "optimization", except `Counter` may be a bit faster than a loop). I'm trying to think of a data structure supporting this but I'm not sure yet. — jdehesa, May 01 '19 at 12:43
Do you have an idea of how many unique numbers are there in the array? — jdehesa, May 01 '19 at 13:18
@jdehesa There is no mention of unique numbers in the question. But we can count them easily using ```len(set(a))```. — Mohan Singh, May 01 '19 at 13:24
Btw in your code it seems `l` and `r` change according to the last `ans` value, is that right? That doesn't match with the example input and output I think (I assume `l` and `r` values are always relative to the beginning of the array?) — jdehesa, May 01 '19 at 13:29
No, they are perfectly correct. Initially ```ans=0``` For query 1: ` l = (l+ans)%n = (0+0)%5 = 0, r = (r+ans)%n = (4+0)%5 = 4 ` Now,ans = 2 For query 2: ` l = (l+ans)%n = (3+2)%5 = 0, r = (r+ans)%n = (0+2)%5 = 2 ` Now,ans = 1 For query 3: ` l = (l+ans)%n = (4+1)%5 = 0, r = (r+ans)%n = (1+1)%5 = 2 ` Now, ans=1 — Mohan Singh, May 01 '19 at 13:51

score 0 · Answer 1 · answered May 01 '19 at 14:28

If the number of different values in the array is not too large, you may consider storing arrays as long as the input array, one per unique value, counting the number of appearances of the value until each point. Then you just need to subtract the end values from the beginning values to find how many frequency matches are there:

def range_freq_queries(seq, k, queries):
    n = len(seq)
    c = freq_counts(seq)
    result = [0] * len(queries)
    offset = 0
    for i, (l, r) in enumerate(queries):
        result[i] = range_freq_matches(c, offset, l, r, k, n)
        offset = result[i]
    return result

def freq_counts(seq):
    s = {v: i for i, v in enumerate(set(seq))}
    counts = [None] * (len(seq) + 1)
    counts[0] = [0] * len(s)
    for i, v in enumerate(seq, 1):
        counts[i] = list(counts[i - 1])
        j = s[v]
        counts[i][j] += 1
    return counts

def range_freq_matches(counts, offset, start, end, k, n):
    start, end = sorted(((start + offset) % n, (end + offset) % n))
    num = 0
    return sum(1 for cs, ce in zip(counts[start], counts[end + 1]) if ce - cs == k)

seq = [7, 6, 6, 5, 5]
k = 2
queries = [(0, 4), (3, 0), (4, 1)]
print(range_freq_queries(seq, k, queries))
# [2, 1, 1]

You can do it faster with NumPy, too. Since each result depends on the previous one, you will have to loop in any case, but you can use Numba to really accelerate things up:

import numpy as np
import numba as nb

def range_freq_queries_np(seq, k, queries):
    seq = np.asarray(seq)
    c = freq_counts_np(seq)
    return _range_freq_queries_np_nb(seq, k, queries, c)

@nb.njit  # This is not necessary but will make things faster
def _range_freq_queries_np_nb(seq, k, queries, c):
    n = len(seq)
    offset = np.int32(0)
    out = np.empty(len(queries), dtype=np.int32)
    for i, (l, r) in enumerate(queries):
        l = (l + offset) % n
        r = (r + offset) % n
        l, r = min(l, r), max(l, r)
        out[i] = np.sum(c[r + 1] - c[l] == k)
        offset = out[i]
    return out

def freq_counts_np(seq):
    uniq = np.unique(seq)
    seq_pad = np.concatenate([[uniq.max() + 1], seq])
    comp = seq_pad[:, np.newaxis] == uniq
    return np.cumsum(comp, axis=0)

seq = np.array([7, 6, 6, 5, 5])
k = 2
queries = [(0, 4), (3, 0), (4, 1)]
print(range_freq_queries_np(seq, k, queries))
# [2 1 2]

Let's compare it with the original algorithm:

from collections import Counter

def range_freq_queries_orig(seq, k, queries):
    n = len(seq)
    ans = 0
    counter = Counter()
    out = [0] * len(queries)
    for i, (l, r) in enumerate(queries):
        l += ans
        l %= n
        r += ans
        r %= n
        if l > r:
            l, r = r, l
        counter.clear()
        counter.update(seq[l:r+1])
        ans = sum(1 for v in counter.values() if v == k)
        out[i] = ans
    return out

Here is a quick test and timing:

import random
import numpy

# Make random input
random.seed(0)
seq = random.choices(range(1000), k=5000)
queries = [(random.choice(range(len(seq))), random.choice(range(len(seq))))
           for _ in range(20000)]
k = 20
# Input as array for NumPy version
seq_arr = np.asarray(seq)
# Check all functions return the same result
res1 = range_freq_queries_orig(seq, k, queries)
res2 = range_freq_queries(seq, k, queries)
print(all(r1 == r2 for r1, r2 in zip(res1, res2)))
# True
res3 = range_freq_queries_np(seq_arr, k, queries)
print(all(r1 == r3 for r1, r3 in zip(res1, res3)))
# True

# Timings
%timeit range_freq_queries_orig(seq, k, queries)
# 3.07 s ± 1.11 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit range_freq_queries(seq, k, queries)
# 1.1 s ± 307 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit range_freq_queries_np(seq_arr, k, queries)
# 265 ms ± 726 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

Obviously the effectiveness of this depends on the characteristics of the data. In particular, if there are fewer repeated values the time and memory cost to construct the counts table will approach O(n²).

In ```freq_counts_np```, you cannot make a 2D-array ```comp``` because it will exceed the memory limits as maximum element can be 10^9 and maximum n = 10^5. So, a 2D array of size (10^5)*(10^9) will exceed the memory limits. — Mohan Singh, May 01 '19 at 21:18
@MohanSingh But the number of "columns" in the table will be equal to the number of **unique** values in the array, which cannot be larger than the array itself. In the worst case you would have a table with size (10^5)*(10^5). This is still too large but the assumption is there will be repeated values so there would not be as many unique numbers. — jdehesa, May 01 '19 at 22:12
@MohanSingh You could also make the table "by pieces", taking a maximum number of columns at a time and iterating over it. It takes less memory but more iterations. But if you really have very few repetitions it may not work well in any case. — jdehesa, May 01 '19 at 22:15
Yes absolutely. That's why there should be an another approach to this problem. Maybe using segment tree or square-root decomposition. — Mohan Singh, May 01 '19 at 22:23
Using Segment Tree you can try keeping a map of frequency at every node. But for this also time complexity will be high when the values are distinct. Please correct me if I am wrong. — atishaya11, May 02 '19 at 08:01

Dave · Answer 2 · 2019-05-02T13:40:29.307

Let's say the input array is A, |A|=n. I'm going to assume that the number of distinct elements in A is much smaller than n.

We can divide A into sqrt(n) segments each of size sqrt(n). For each of these segments, we can calculate a map from element to count. Building these maps takes O(n) time.

With that preprocessing done, we can answer each query by adding together all the maps wholly contained in (l,r), of which there are at most sqrt(n), then adding any extra elements (or going one segment over and subtracting), also sqrt(n).

If there are k distinct elements, this takes O(sqrt(n) * k) so in the worst case O(n) if in fact every element of A is distinct.

You can keep track of the elements that have the desired count while combining the hashes and extra elements.

Total numbers having frequency k in a given range

2 Answers2