1

I have the following issue when training a Naive Bayes classifier. I'm getting this error:

  File "/home/juande/Desktop/spark-1.3.0-bin-hadoop2.4/python/pyspark/mllib  /classification.py", line 372, in train
return NaiveBayesModel(labels.toArray(), pi.toArray(), numpy.array(theta))
ValueError: invalid __array_struct__

When training the model using this line

dataframe = dataframe.map(lambda x: LabeledPoint(sections_to_number[x[4]], tf.transform([x[0], x[1], x[2], x[3]])))
model = NaiveBayes.train(dataframe, 1.0)

Where sections_to_number is a dictionary that maps the value from some strings to float numbers, for example sports -> 0, weather -> 1 and so on.

However, if I train it using a number instead of using the mapping sections_to_number, then I do not get any error.

dataframe = dataframe.map(lambda x: LabeledPoint(10.0, tf.transform([x[0], x[1], x[2], x[3]])))
model = NaiveBayes.train(dataframe, 1.0)

Am I missing something? Thanks

zero323
  • 322,348
  • 103
  • 959
  • 935
user3276768
  • 1,416
  • 3
  • 18
  • 28

1 Answers1

0

NaiveBayes in spark ml package expects dataframe in the form of two columns label,feature where lable column is target or class and feature is org.apache.spark.ml.linalg.Vector. In case of numeric/ continuous dataset feature column is created using Vector as dataset is continuous but we need to convert categorical dataset into numeric using onehotencoder of some other feature extraction techniques shared at http://spark.apache.org/docs/latest/ml-features.html#stringindexer.

e.g. OneHotEncoder converts foo - 0 and baar - 1 and forms Vector of double, and finally dataframe like lable and feature is passed in algorithm