Number of substrings of a string with given count of each character

Question

Given a string (s) and an integer k, we need to find number of sub strings in which all the different characters occurs exactly k times.

Example: s = "aabbcc", k = 2 Output : 6

The substrings [aa, bb, cc, aabb, bbcc and aabbcc] contains distinct characters with frequency 2.

The approach I can think of is to traverse through all sub strings and store frequency of current sub string, and increment the result when frequency equals k. This will result in worst case complexity of O(n*n), where n is the length of the string s.

Is there any better approach for this problem?

What about this rule: "in which all the different characters occurs exactly k times."? In case of `s=abab` is `abab` valid for `k=2` because each character is there 2 times? Or same characters have to be next to each other? — libik, Aug 14 '20 at 13:20
Yes, "abab" will be one such valid sub string for s = "abab" and k=2. — codeplay_, Aug 14 '20 at 13:23

גלעד ברקן · Answer 1 · 2020-08-22T23:08:40.870

We can solve this in O(n * log(size_of_alphabet)). Let f(i) represent the most valid substrings ending at the ith character. Then:

f(i) ->
  1 + f(j - 1)
  
where j is the rightmost index smaller
than or equal to i where s[j..i] is a
valid substring and (j - 1) is inside
the current window. Call s[j..i] the
"minimal" valid substring ending at
index i.

An invariant for our window is that if a character is seen k + 1 times, we move the left bound just past that character's leftmost instance in the window. This guarantees that any two substrings in a string of concatenated, valid substrings in the current window cannot have a shared character, and thus remain a valid concatenation.

Each time we reach the kth instance of character c, the rightmost index smaller than or equal to i where s[j..i] is a valid substring must start to the right of all characters in the window who's count is less than k. To find the rightmost such index, we may also need to move ahead of valid neighbouring substrings already seen in the window.

To find that index, we can maintain a max indexed-heap that stores the rightmost instance of each distinct character in our window currently with counts less than k, prioritised by their index, such that our j is always to the right of the heap's root (or the heap is empty). The heap is indexed, which alllows us to remove specific elements in O(log(size_of_alphabet)).

We also keep the right and left boundary indexes of valid minimal substrings already seen in the window. We can use a double ended queue for that for O(1) updates since a valid substring can appear to the right of another or envelope existing ones. And we keep a hashmap of the left boundaries for O(1) lookup.

Additionally, we must keep a count of each distinct character in the window in order to maintain our invariant, no such count above k, and their leftmost index in the window for the valid substring precondition.

Procedure:

for each index i in s:
  let c be the character s[i]
  
  if s[i] is the (k+1)th instance of c in the window:
    move the left bound of the window
    just past the leftmost instance of
    c in the window, removing all
    elements in the heap who's rightmost
    instance we passed while updating
    our window; and adding to the heap
    the rightmost instance of characters
    who's count has fallen below k
    as we move the left bound of
    the window. If the boundary moves
    past the left bound of valid minimal
    substrings, remove their boundaries
    from the queue, and their left bound
    from the hashmap.
    
  if s[i] is the kth instance of c:
    remove the previous instance of c
    from the heap.
    if the leftmost instance of c in the
    window is to the right of the heap
    root:
      if (root_index + 1) is the
      left bound of a valid minimal
      substring in our queue:
        we must be adding to the right
        of all of them, so add a new
        valid minimal substring, starting
        at the next index after the
        rightmost of those that ends
        at i
      otherwise:
        add a new valid minimal substring,
        starting at (root_index + 1)
        and ending at i
    
  otherwise:
    remove the previous instance of c
    in the heap and insert this one.

For example:

01234567
acbbaacc  k = 2

0 a  heap: (0 a)

1 c  heap: (1 c) <- (0 a)

2 b  heap: (2 b) <- (1 c) <- (0 a)

3 b  kth instance, remove (2 b)
     heap: (1 c) <- (0 a)
     leftmost instance of b is to the
     right of the heap root.
     check root + 1 = 2, which points
     to a new valid substring, add the
     substring to the queue
     queue: (2, 3)
     result: 1 + 0 = 1
     
4 a  kth instance, remove (0 a)
     heap: (1 c)
     queue: (2, 3)
     result: 1
     leftmost instance of a is left
     of the heap root so continue
     
5 a  (k+1)th instance, move left border
     of the window to index 1
     heap: (1 c)
     queue: (2, 3)
     result: 1
     
     (5 a) is now the kth instance of
     a and its leftmost instance is to
     the right of the heap root.
     check root + 1 = 2, which points
     to a valid substring in the queue,
     add new substring to queue
     heap: (1 c)
     queue: (2, 3) -> (4, 5)
     result: 1 + 1 + 1 = 3
     
6 c  kth instance, remove (1 c)
     heap: empty
     add new substring to queue
     queue: (1) -> (2, 3) -> (4, 5) -> (6)
     (for simplicity, the queue here
     is not labeled; labels may be needed
     for the split intervals)
     result: 3 + 1 + 0 = 4
     
7 c  (k+1)th instance, move left border
     of the window to index 2, update queue
     heap: empty
     queue: (2, 3) -> (4, 5)
     result: 4
     
     (7 c) is now the kth instance of c
     heap: empty
     add new substring to queue
     queue: (2, 3) -> (4, 5) -> (6, 7)
     result: 4 + 1 + 2 = 7

While amazing in detail, I'm surprised you didn't apply the overwhelming optimization I explain in my answer. — Fattie, Sep 23 '20 at 12:52

score 0 · Answer 2 · answered Aug 14 '20 at 13:12

0

The length of such strings simply has to be exactly a multiple of K. This slashes the depth of the search.

{Indeed, it can only be K multiplied by one of the integers up to the count of distinct characters.}

answered Aug 14 '20 at 13:12

Fattie

27,874
70
431
719

Well sure, in the trivial case of k1/all different chars. But I can assure, this is "the" optimization for this problem. No matter what algorithm you adopt, this optimization will slash the order. – Fattie Aug 14 '20 at 13:26

Number of substrings of a string with given count of each character

2 Answers2