How to generate n-grams in scala?

Question

I am trying to code dissociated press algorithm based on n-gram in scala. How to generate an n-gram for a large files: For example, for the file containing "the bee is the bee of the bees".

First it has to pick a random n-gram. For example, the bee.
Then it has to look for n-grams starting with (n-1) words. For example, bee of.
it prints the last word of this n-gram. Then repeats.

Can you please give me some hints how to do it? Sorry for the inconvenience.

I don't know what a n-gram is. Are you just choosing words randomly? Or has some logic? — santiagobasulto, Nov 24 '11 at 15:01
@santiagobasulto Wikipedia is your friend: http://en.wikipedia.org/wiki/N-gram — Matthew Farwell, Nov 24 '11 at 15:02
Is this by any chance related to http://stackoverflow.com/questions/8256830/how-to-make-string-sequence-in-scala? — Matthew Farwell, Nov 24 '11 at 15:51

score 14 · Answer 1 · answered Nov 24 '11 at 15:08

14

Your questions could be a little more specific but here is my try.

val words = "the bee is the bee of the bees"
words.split(' ').sliding(2).foreach( p => println(p.mkString))

answered Nov 24 '11 at 15:08

peri4n

1,389
13
24

2

Not that this will only give you 2-grams. If n-grams are desired, then n needs to be parameterized. – tuxdna Dec 17 '13 at 12:50
@tuxdna but it can be easily adjusted – Szymon Roziewski May 20 '20 at 18:36

score 5 · Answer 2 · answered May 24 '13 at 09:58

You may try this with a parameter of n

val words = "the bee is the bee of the bees"
val w = words.split(" ")

val n = 4
val ngrams = (for( i <- 1 to n) yield w.sliding(i).map(p => p.toList)).flatMap(x => x)
ngrams foreach println

List(the)
List(bee)
List(is)
List(the)
List(bee)
List(of)
List(the)
List(bees)
List(the, bee)
List(bee, is)
List(is, the)
List(the, bee)
List(bee, of)
List(of, the)
List(the, bees)
List(the, bee, is)
List(bee, is, the)
List(is, the, bee)
List(the, bee, of)
List(bee, of, the)
List(of, the, bees)
List(the, bee, is, the)
List(bee, is, the, bee)
List(is, the, bee, of)
List(the, bee, of, the)
List(bee, of, the, bees)

score 4 · Answer 3 · answered Dec 17 '13 at 12:48

Here is a stream based approach. This will not required too much memory while computing n-grams.

object ngramstream extends App {

  def process(st: Stream[Array[String]])(f: Array[String] => Unit): Stream[Array[String]] = st match {
    case x #:: xs => {
      f(x)
      process(xs)(f)
    }
    case _ => Stream[Array[String]]()
  }

  def ngrams(n: Int, words: Array[String]) = {
    // exclude 1-grams
    (2 to n).map { i => words.sliding(i).toStream }
      .foldLeft(Stream[Array[String]]()) {
        (a, b) => a #::: b
      }
  }

  val words = "the bee is the bee of the bees"
  val n = 4
  val ngrams2 = ngrams(n, words.split(" "))

  process(ngrams2) { x =>
    println(x.toList)
  }

}

OUTPUT:

List(the, bee)
List(bee, is)
List(is, the)
List(the, bee)
List(bee, of)
List(of, the)
List(the, bees)
List(the, bee, is)
List(bee, is, the)
List(is, the, bee)
List(the, bee, of)
List(bee, of, the)
List(of, the, bees)
List(the, bee, is, the)
List(bee, is, the, bee)
List(is, the, bee, of)
List(the, bee, of, the)
List(bee, of, the, bees)

I like it, not sure of the usefulness of `process`. Why not just do `ngrams(...).foreach(x=>println(x.toList))`? — Mortimer, Mar 18 '14 at 13:51
@Mortimer: Interesting question. `process` is just an additional function. We can definitely use `ngrams2 foreach { x => println(x.toList)}`. Thanks :-) — tuxdna, Mar 19 '14 at 11:57

How to generate n-grams in scala?

3 Answers3

Linked