0

I am trying to predict a text field based on other text fields on predictionio. I used this guide for reference. I created a new app using

pio app new MyTextApp

and followed the guide upto evaluation using datasource provided in template. It was all okay upto evaluation. On evaluating data source I am getting error as pasted below.

[INFO] [CoreWorkflow$] runEvaluation started
[WARN] [Utils] Your hostname, my-ThinkCentre-Edge72 resolves to a  loopback address: 127.0.0.1; using 192.168.65.27 instead (on interface eth0)
[WARN] [Utils] Set SPARK_LOCAL_IP if you need to bind to another address
[INFO] [Remoting] Starting remoting
[INFO] [Remoting] Remoting started; listening on addresses  :[akka.tcp://sparkDriver@192.168.65.27:59649]
[INFO] [CoreWorkflow$] Starting evaluation instance ID: AU29p8j3Fkwdnkfum_ke
[INFO] [Engine$] DataSource: org.template.textclassification.DataSource@faea4da
[INFO] [Engine$] Preparator: org.template.textclassification.Preparator@69f2cb04
[INFO] [Engine$] AlgorithmList: List(org.template.textclassification.NBAlgorithm@45292ec1)
[INFO] [Engine$] Serving: org.template.textclassification.Serving@1ad9b8d3
Exception in thread "main" java.lang.UnsupportedOperationException: empty.maxBy
at scala.collection.TraversableOnce$class.maxBy(TraversableOnce.scala:223)
at scala.collection.AbstractTraversable.maxBy(Traversable.scala:105)
at org.template.textclassification.PreparedData.<init>(Preparator.scala:152)
at org.template.textclassification.Preparator.prepare(Preparator.scala:38)
at org.template.textclassification.Preparator.prepare(Preparator.scala:34)

Do I have to edit any config files to make this work? I have successfully ran tests on movielens data.

cutteeth
  • 2,148
  • 3
  • 25
  • 45

1 Answers1

3

So this particular error message occurs when your data isn't getting read properly through the DataSource class. If you're using a different text data set, then make sure that you are correctly reflecting any changes to the eventNames, entityType, and respective property field names in the readEventData method.

The maxBy method is used to pull the class with the highest number of observations. If the category to label Map is empty, it means that there are no classes being recorded, which essentially tells you have no data being fed in.

For example, I just did a spam detector using this engine. My e-mail data is of the form:

{"entityType": "content", "eventTime": "2015-06-04T00:22:39.064+0000", "entityId": 1, "event": "e-mail", "properties": {"label": "spam", "text": "content"}}

To use the engine for this data I made the following changes in the DataSource class:

entityType = Some("source"), // specify data entity type eventNames = Some(List("documents")) // specify data event name

changes to

entityType = Some("content"), // specify data entity type eventNames = Some(List("e-mail")) // specify data event name

and

)(sc).map(e => Observation(
  e.properties.get[Double]("label"),
  e.properties.get[String]("text"),
  e.properties.get[String]("category")
)).cache

changes to:

)(sc).map(e => {
  val label = e.properties.get[String]("label")


  Observation(
    if (label == "spam") 1.0 else 0.0,
    e.properties.get[String]("text"),
    label
  )
}).cache

After this, I'm able to go through building, training, and deployment, as well as an evaluation.

Marco Vivero
  • 301
  • 1
  • 2
  • Thanks for the info. I was using the same app for different data sets. I deleted existing app,its data and created new app, then run pio build,train and deploy. Now its working fine. :) – cutteeth Jun 08 '15 at 09:21
  • Awesome, I'm glad the response helped! I just released a new version of the engine that contains a sanity check to make sure that the training data is actually being fed in. The PreparedClass was also modified, so that the text vectorization processing is done quicker. – Marco Vivero Jun 08 '15 at 19:25
  • I have downloaded the latest text classification template (2.0) and the same issue is in the recent update too. Evaluation fails with an error `java.lang.UnsupportedOperationException: empty.maxBy` and train fails with `io.prediction.data.storage.DataMapException: The field label is required.` pio says that spark address is bind to loopback. Do I have to change it to public ip? Also could you please explain text vectorization? – cutteeth Jun 10 '15 at 06:05
  • Hey, I just responded in your other question: http://stackoverflow.com/questions/30771784/predictionio-evaluation-fails-with-empty-maxby-exception-and-training-with-java – Marco Vivero Jun 11 '15 at 16:16