Text mining with Scala

Question

I have a .txt file with the following data:

L666371 +++$+++ u9030 +++$+++ m616 +++$+++ DURNFORD +++$+++ Lord Chelmsford seems to want me to stay back with my Basutos.
L666370 +++$+++ u9034 +++$+++ m616 +++$+++ VEREKER +++$+++ I'm to take the Sikali with the main column to the river
L666369 +++$+++ u9030 +++$+++ m616 +++$+++ DURNFORD +++$+++ Your orders, Mr Vereker?
L666257 +++$+++ u9030 +++$+++ m616 +++$+++ DURNFORD +++$+++ Good ones, yes, Mr Vereker. Gentlemen who can ride and shoot
L666256 +++$+++ u9034 +++$+++ m616 +++$+++ VEREKER +++$+++ Colonel Durnford... William Vereker. I hear you 've been seeking Officers?

I want to import the text file into Scala (which I've done), and then work on it by extracting all the text. After that: tokenise, lowercase, ignore word forms, separate punctuation, after which I want to learn the count of words in a form like this: the unigram, bigram and trigram count, sorting the results by highest count at the top.

Can anybody tell me how I’d implement this? I have the following attempt, but it doesn’t seem to be working:

import io.Source
val s = Source.fromFile("movie_lines.txt")("ISO-8859-1")
val lines = s.getLines
val str = s.mkString

val Pattern = "([A-Z]+.!)".r`enter code here`

Pattern.findAllIn(str).foreach { x => println(x) }

println ("\n This is the result\n")`enter code here`
  }

score 0 · Answer 1 · answered Mar 02 '15 at 07:56

0

You can use the Epic library from the ScalaNLP suit for preprocesing the text (tokenizing), then parse, tag and extract entities.

answered Mar 02 '15 at 07:56

jepemo

1
1
1

Text mining with Scala

1 Answers1