3

what is the elegant way for finding how many times a given string is a substring of another string using scala?

The below test cases that should make clear what are the requirements:

import org.scalatest.FunSuite

class WordOccurrencesSolverTest extends FunSuite {

  private val solver = new WordOccurrencesSolver()

  test("solve for a in a") {
    assert(solver.solve("a", "a") === 1)
  }

  test("solve for b in a") {
    assert(solver.solve("b", "a") === 0)
  }

  test("solve for a in aa") {
    assert(solver.solve("a", "aa") === 2)
  }

  test("solve for b in ab") {
    assert(solver.solve("b", "ab") === 1)
  }

  test("solve for ab in ab") {
    assert(solver.solve("ab", "ab") === 1)
  }

  test("solve for ab in abab") {
    assert(solver.solve("ab", "abab") === 2)
  }

  test("solve for aa in aaa") {
    assert(solver.solve("aa", "aaa") === 2)
  }
}

Here is my solution to the problem of which I am not particularly proud:

class WordOccurrencesSolver {

  def solve(word: String, text: String): Int = {
    val n = word.length
    def solve(acc: Int, word: String, sb: String): Int = sb match {
      case _ if sb.length < n => acc
      case _ if sb.substring(0, n) == word => solve(acc + 1, word, sb.tail)
      case _ => solve(acc, word, sb.tail)
    }
    solve(0, word, text)
  }

}

I assume there must be a clean one liner which takes advantage of Scala's higher order functions instead of recursion and match/case clause.

GA1
  • 1,568
  • 2
  • 19
  • 30
  • 2
    Well... I will advise you to atay away from the temptation of finding one liners. First try to focus on the `time-complexity` of your solutions. Only after taking care of that should you look for one-liners or elegance. Try to think of a solution which avoids `n` time usage of substring (which is actually O(n)) making your solution very poor in performence. – sarveshseri Apr 10 '17 at 12:50
  • Similarly `.tail` for String is also O(n) and should be avoided. – sarveshseri Apr 10 '17 at 12:55
  • Good point, I missed it somehow. – GA1 Apr 10 '17 at 13:09

3 Answers3

14

If you are looking for an idiomatic Scala solution then you can use the sliding to create a sliding window iterator and count the wondows which are equal to your target String.

This solution while being functional also offers you an acceptable performence.

def countOccurrences(src: String, tgt: String): Int =
  src.sliding(tgt.length).count(window => window == tgt)
sarveshseri
  • 13,738
  • 28
  • 47
  • Looks great. What is the time complexity for this solution? – GA1 Apr 10 '17 at 13:12
  • Well... lets say that your `src` is of lenght `m` and `tgt` is of length `n`. Then step 1 - creating the window iterator is `O(m)`. this iterator will have `m-n` window strings. step 2 - count involves a comparision of window string of length `n` with `tgt` string for every window string. so it is `O(n*m)`. So the over all complextity is `O(m*n)` – sarveshseri Apr 10 '17 at 13:21
  • Great, this is exactly the solution I wanted to find. I am still curious though. Are there algorithms which could achieve the same thing (not necessarily one liners) in `O(n)` following your notation? That is algorithms for which the length of `src` is irrelevant, speaking in `O` terms. – GA1 Apr 10 '17 at 13:29
  • SubString search has been a very worked upon field and hence there are many wonderful algorithms, some of which are condition dependent where as some are generally applicable. You can read about them in detail here - http://www-igm.univ-mlv.fr/~lecroq/string/ . But be-warned this written with a mathematics centric perspective. But you can just search for simpler explanations of these algorithms on Google. – sarveshseri Apr 10 '17 at 13:37
  • But there will not be any O(n) algorithms as that will be impossible. Even O(m) algorithms should be impossible. – sarveshseri Apr 10 '17 at 13:40
2

You can probably use this java function:

StringUtils.countMatches(stringToFindMatchesIn, keyWordToFind );

This will return the number of occurences of the keyword in the string

iPhantomGuy
  • 240
  • 1
  • 11
  • I do not think OP is allowed to use that. This looks like a homework question/excercise. – sarveshseri Apr 10 '17 at 12:56
  • StringUtils is from Apache Commons, not from a Core Java library! And it's not a Scala-only approach anyway. – Viacheslav Rodionov May 09 '19 at 14:47
  • 1
    @ViacheslavRodionov it is work at Scala and it's seems correct (does what is need). Also it is probably more efficient than answers upper – Mikhail Ionkin Jul 23 '19 at 09:07
  • @MikhailIonkin please read the original question again and also the comment above mine. The sole fact that this works is not enough to propose this code as a solution. – Viacheslav Rodionov Jul 23 '19 at 19:25
  • 1
    @ViacheslavRodionov At google i found this question, and this answer helps me. It's elegant, tested (production quality) and "one-line" solution. Yes, end of this question about clear Scala solution, but I'm not sure that it is best for all. – Mikhail Ionkin Jul 23 '19 at 20:38
0

If somebody else is looking for a very efficient and semi-idomatic solution, use

extension (s: String) def countSubstr(sub: String, overlap: Boolean = false): Int = {
  if (sub.isEmpty) throw IllegalArgumentException("substring must not be empty")
  val d = if (overlap) 1 else sub.length
  @tailrec
  def count(pos: Int, acc: Int): Int =
    s.indexOf(sub, pos) match {
      case -1 => acc
      case j => count(j + d, acc + 1)
    }
  count(0, 0)
}

For Scala 2 compatibility or if you do not like extensions, use

def countSubstr(s: String, sub: String, overlap: Boolean = false): Int = {
  ...

Runtime is O(s.length) for overlap = false. There are no object allocations, the @tailrec method gets optimized to a jump and the match to an if.

As your last example suggests, you wanted to allow overlaps, so it is slightly slower (but worst case is O(s.length*sub.length)).


If you are looking for a (slower, but may still be faster than sliding) one-liner this may be for you:

  def countSubstr(s: String, regex: String): Int =
    s.split(regex, -1).length - 1

Note:

  • Second argument is a regex, this may lead to unexpected results and be slower if a real regex is used.
  • It does not count overlapping substrings, so it fails on the last test.
  • The -1 as second argument to split is important. Otherwise, substrings at the end would be ignored.
anselm
  • 76
  • 4