0

I am tinkering with predictioIO to build a custom classification engine. I have done this before without issues. But for current dataset pio train is giving me an error tokens must not be empty.I have edited Datasource.scala to mention fields in dataset to engine. A line from my dataset is as below

{"event": "ticket", "eventTime": "2015-02-16T05:22:13.477+0000", "entityType": "content","entityId": 365,"properties":{"text": "Request to reset svn credentials","label": "Linux/Admin Task" }}

I can import data and build engine without any issues. I am getting a set of observations too. The error is pasted below

[INFO] [Remoting] Starting remoting
[INFO] [Remoting] Remoting started; listening on addresses :[akka.tcp://sparkDriver@192.168.61.44:50713]
[INFO] [Engine$] EngineWorkflow.train
[INFO] [Engine$] DataSource: org.template.textclassification.DataSource@4fb64e14
[INFO] [Engine$] Preparator: org.template.textclassification.Preparator@5c4cc644
[INFO] [Engine$] AlgorithmList: List(org.template.textclassification.NBAlgorithm@62b6c045)
[INFO] [Engine$] Data sanity check is off.
[ERROR] [Executor] Exception in task 0.0 in stage 2.0 (TID 2)
[WARN] [TaskSetManager] Lost task 0.0 in stage 2.0 (TID 2, localhost): java.lang.IllegalArgumentException: tokens must not be empty
at opennlp.tools.util.StringList.<init>(StringList.java:61)
at org.template.textclassification.PreparedData.org$template$textclassification$PreparedData$$hash(Preparator.scala:71)
at org.template.textclassification.PreparedData$$anonfun$2.apply(Preparator.scala:113)
at org.template.textclassification.PreparedData$$anonfun$2.apply(Preparator.scala:113)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:202)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:56)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

[ERROR] [TaskSetManager] Task 0 in stage 2.0 failed 1 times; aborting job
Exception in thread "main" org.apache.spark.SparkException: Job aborted   due to stage failure: Task 0 in stage 2.0 failed 1 times, most recent  failure: Lost task 0.0 in stage 2.0 (TID 2, localhost):    java.lang.IllegalArgumentException: tokens must not be empty
at opennlp.tools.util.StringList.<init>(StringList.java:61)
at    org.template.textclassification.PreparedData.org$template$textclassification$PreparedData$$hash(Preparator.scala:71)
at  org.template.textclassification.PreparedData$$anonfun$2.apply(Preparator.scala:113)
at  org.template.textclassification.PreparedData$$anonfun$2.apply(Preparator.scala:113)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at  org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:202)
at  org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:56)
at  org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at  org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at  java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1204)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1193)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1192)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)

The problem is with dataset. I did splitting up dataset into parts and trained. Training was completed for that dataset and no errors were reported. How can I know which line in the dataset produce error? It should be very very helpful if this feature is in PredictionIO .

cutteeth
  • 2,148
  • 3
  • 25
  • 45

1 Answers1

1

So this is something that happens when you feed in an empty Array[String] to OpenNLP's StringList constructor. Try modifying the function hash in Prepared Data as follows:

private def hash (tokenList : Array[String]): HashMap[String, Double] = {
// Initialize an NGramModel from OpenNLP tools library,
// and add the list of allowable tokens to the n-gram model.
try {
  val model : NGramModel = new NGramModel()
  model.add(new StringList(tokenList: _*), nMin, nMax)

  val map : HashMap[String, Double] = HashMap(
    model.iterator.map(
      x => (x.toString, model.getCount(x).toDouble)
    ).toSeq : _*
  )

  val mapSum = map.values.sum

  // Divide by the total number of n-grams in the document
  // to obtain n-gram frequency.
  map.map(e => (e._1, e._2 / mapSum))
} catch {
  case (e : IllegalArgumentException) => HashMap("" -> 0.0)
}

I've only encountered this issue in the prediction stage, and so you can see this is actually implemented in the models' predict methods. I'll update this right now, and put it in a new version release. Thank you for the catch and feedback!

Marco Vivero
  • 301
  • 1
  • 2
  • I have pulled template from git repository using `pio template get PredictionIO/template-scala-parallel-textclassification < Your new engine directory >` and tried training on the same dataset. I have completed without any issues :). You guys are so fast. – cutteeth Jun 24 '15 at 05:34