4

Please keep in mind I'm new to scala.

This is the example I am trying to follow: https://spark.apache.org/docs/1.5.1/ml-ann.html

It uses this dataset: https://github.com/apache/spark/blob/master/data/mllib/sample_multiclass_classification_data.txt

I have prepared my .csv using the code below to get a data frame for classification in Scala.

//imports for ML
import org.apache.spark.ml.classification.MultilayerPerceptronClassifier
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.sql.Row

//imports for transformation
import sqlContext.implicits._
import com.databricks.spark.csv._
import org.apache.spark.mllib.linalg.{Vector, Vectors}

//load data
val data2 = sqlContext.csvFile("/Users/administrator/Downloads/ds_15k_10-2.csv")

//Rename any one column to features
//val df2 = data.withColumnRenamed("ip_crowding", "features")
val DF2 = data2.select("gst_id_matched","ip_crowding","lat_long_dist");

scala> DF2.take(2)
res6: Array[org.apache.spark.sql.Row] = Array([0,0,0], [0,0,1628859.542])

//define doublelfunc
val toDouble = udf[Double, String]( _.toDouble)

//Convert all to double
val featureDf = DF2
.withColumn("gst_id_matched",toDouble(DF2("gst_id_matched")))
.withColumn("ip_crowding",toDouble(DF2("ip_crowding")))
.withColumn("lat_long_dist",toDouble(DF2("lat_long_dist")))
.select("gst_id_matched","ip_crowding","lat_long_dist")


//Define the format
val toVec4 = udf[Vector, Double,Double] { (v1,v2) => Vectors.dense(v1,v2) }

//Format for features which is gst_id_matched
val encodeLabel   = udf[Double, String]( _ match 
{ case "0.0" => 0.0 case "1.0" => 1.0} )

//Transformed dataset
    val df = featureDf
.withColumn("features",toVec4(featureDf("ip_crowding"),featureDf("lat_long_dist")))
.withColumn("label",encodeLabel(featureDf("gst_id_matched")))
.select("label", "features")

val splits = df.randomSplit(Array(0.6, 0.4), seed = 1234L)
val train = splits(0)
val test = splits(1)
// specify layers for the neural network: 
// input layer of size 4 (features), two intermediate of size 5 and 4 and output of size 3 (classes)
val layers = Array[Int](0, 0, 0, 0)
// create the trainer and set its parameter


val trainer = new MultilayerPerceptronClassifier().setLayers(layers).setBlockSize(12).setSeed(1234L).setMaxIter(10)
// train the model
val model = trainer.fit(train)

The last line generates this error

15/11/21 22:46:23 ERROR Executor: Exception in task 1.0 in stage 11.0 (TID 15)
java.lang.ArrayIndexOutOfBoundsException: 0

My suspicions:

When I examine the dataset,it looks fine for classification

scala> df.take(2)
res3: Array[org.apache.spark.sql.Row] = Array([0.0,[0.0,0.0]], [0.0,[0.0,1628859.542]])

But the apache example dataset is different and my transformation does not give me what I need.Can some one please help me with the dataset transformation or understand the root cause of the problem.

This is what the apache dataset looks like:

scala> data.take(1)
res8: Array[org.apache.spark.sql.Row] = Array([1.0,(4,[0,1,2,3],[-0.222222,0.5,-0.762712,-0.833333])])
steven
  • 644
  • 1
  • 11
  • 23

1 Answers1

6

The source of your problems is a wrong definition of layers. When you use

val layers = Array[Int](0, 0, 0, 0)

it means you want a network with zero nodes in each layer which simply doesn't make sense. Generally speaking number of neurons in the input layer should be equal to the number of features and each hidden layer should contain at least one neuron.

Lets recreate your data simpling your code on the way:

import org.apache.spark.sql.functions.col

val df = sc.parallelize(Seq(
  ("0", "0", "0"), ("0", "0", "1628859.542")
)).toDF("gst_id_matched", "ip_crowding", "lat_long_dist")

Convert all columns to doubles:

val numeric = df
  .select(df.columns.map(c => col(c).cast("double").alias(c)): _*)
  .withColumnRenamed("gst_id_matched", "label")

Assemble features:

import org.apache.spark.ml.feature.VectorAssembler

val assembler = new VectorAssembler()
  .setInputCols(Array("ip_crowding","lat_long_dist"))
  .setOutputCol("features")

val data = assembler.transform(numeric)
data.show

// +-----+-----------+-------------+-----------------+
// |label|ip_crowding|lat_long_dist|         features|
// +-----+-----------+-------------+-----------------+
// |  0.0|        0.0|          0.0|        (2,[],[])|
// |  0.0|        0.0|  1628859.542|[0.0,1628859.542]|
// +-----+-----------+-------------+-----------------+

Train and test network:

import org.apache.spark.ml.classification.MultilayerPerceptronClassifier

val layers = Array[Int](2, 3, 5, 3) // Note 2 neurons in the input layer
val trainer = new MultilayerPerceptronClassifier()
  .setLayers(layers)
  .setBlockSize(128)
  .setSeed(1234L)
  .setMaxIter(100)

val model = trainer.fit(data)
model.transform(data).show

// +-----+-----------+-------------+-----------------+----------+
// |label|ip_crowding|lat_long_dist|         features|prediction|
// +-----+-----------+-------------+-----------------+----------+
// |  0.0|        0.0|          0.0|        (2,[],[])|       0.0|
// |  0.0|        0.0|  1628859.542|[0.0,1628859.542]|       0.0|
// +-----+-----------+-------------+-----------------+----------+
zero323
  • 322,348
  • 103
  • 959
  • 935
  • How do you decide number of input neurons in case of text classification(while using tokenizer and hashing trick)? – raxith Nov 26 '15 at 18:01
  • It should be equal to `numFeatures` in `HashingTF`. – zero323 Nov 26 '15 at 18:04
  • @zero323 the output layer has 3 neurons, but why is the output just a `Double` ("single neuron")? In my mind it should be a `Vector` or `VectorUDT` but having a label column of `VectorUDT` throws `java.lang.IllegalArgumentException: requirement failed: Column replylabels must be of type DoubleType but was actually org.apache.spark.mllib.linalg.VectorUDT@f71b0bce.` – Hendy Irawan Apr 01 '16 at 18:41
  • @HendyIrawan Because it is not a neuron but label (class) prediction. As far as I remember there is no way to get raw output in ML MP. – zero323 Apr 01 '16 at 18:46
  • Ia there a way to train & predict based on output neurons? So I can train 1 model having 1000 output neurons, instead of training 1000 binary classifier models... – Hendy Irawan Apr 01 '16 at 22:47