8

I am trying to code dissociated press algorithm based on n-gram in scala. How to generate an n-gram for a large files: For example, for the file containing "the bee is the bee of the bees".

  1. First it has to pick a random n-gram. For example, the bee.
  2. Then it has to look for n-grams starting with (n-1) words. For example, bee of.
  3. it prints the last word of this n-gram. Then repeats.

Can you please give me some hints how to do it? Sorry for the inconvenience.

Mihai Iorga
  • 39,330
  • 16
  • 106
  • 107
user1002579
  • 129
  • 2
  • 8

3 Answers3

14

Your questions could be a little more specific but here is my try.

val words = "the bee is the bee of the bees"
words.split(' ').sliding(2).foreach( p => println(p.mkString))
peri4n
  • 1,389
  • 13
  • 24
5

You may try this with a parameter of n

val words = "the bee is the bee of the bees"
val w = words.split(" ")

val n = 4
val ngrams = (for( i <- 1 to n) yield w.sliding(i).map(p => p.toList)).flatMap(x => x)
ngrams foreach println

List(the)
List(bee)
List(is)
List(the)
List(bee)
List(of)
List(the)
List(bees)
List(the, bee)
List(bee, is)
List(is, the)
List(the, bee)
List(bee, of)
List(of, the)
List(the, bees)
List(the, bee, is)
List(bee, is, the)
List(is, the, bee)
List(the, bee, of)
List(bee, of, the)
List(of, the, bees)
List(the, bee, is, the)
List(bee, is, the, bee)
List(is, the, bee, of)
List(the, bee, of, the)
List(bee, of, the, bees)
tuxdna
  • 8,257
  • 4
  • 43
  • 61
4

Here is a stream based approach. This will not required too much memory while computing n-grams.

object ngramstream extends App {

  def process(st: Stream[Array[String]])(f: Array[String] => Unit): Stream[Array[String]] = st match {
    case x #:: xs => {
      f(x)
      process(xs)(f)
    }
    case _ => Stream[Array[String]]()
  }

  def ngrams(n: Int, words: Array[String]) = {
    // exclude 1-grams
    (2 to n).map { i => words.sliding(i).toStream }
      .foldLeft(Stream[Array[String]]()) {
        (a, b) => a #::: b
      }
  }

  val words = "the bee is the bee of the bees"
  val n = 4
  val ngrams2 = ngrams(n, words.split(" "))

  process(ngrams2) { x =>
    println(x.toList)
  }

}

OUTPUT:

List(the, bee)
List(bee, is)
List(is, the)
List(the, bee)
List(bee, of)
List(of, the)
List(the, bees)
List(the, bee, is)
List(bee, is, the)
List(is, the, bee)
List(the, bee, of)
List(bee, of, the)
List(of, the, bees)
List(the, bee, is, the)
List(bee, is, the, bee)
List(is, the, bee, of)
List(the, bee, of, the)
List(bee, of, the, bees)
tuxdna
  • 8,257
  • 4
  • 43
  • 61
  • 1
    I like it, not sure of the usefulness of `process`. Why not just do `ngrams(...).foreach(x=>println(x.toList))`? – Mortimer Mar 18 '14 at 13:51
  • @Mortimer: Interesting question. `process` is just an additional function. We can definitely use `ngrams2 foreach { x => println(x.toList)}`. Thanks :-) – tuxdna Mar 19 '14 at 11:57