2

I am running an RF model on spark, [https://spark.apache.org/docs/2.0.0/ml-classification-regression.html#random-forest-classifier]

My issue is that if I load 2 different dataframes for train and test, eg:

val Array(trainingData, testData) = Array(convertedVecDF, convertedVecDF_test)

I get the above "java.util.NoSuchElementException: key not found: -1.0" being the cause of the error, but when I do the following, I get none

val data = convertedVecDF.union(convertedVecDF_test)
val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3))

And run the code in the link, it works fine.

The class of all variables: data, convertedVecDF, convertedDF_test, trainingData, testData is Class[_ <: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]] = class org.apache.spark.sql.Dataset

When I run seperate out the variables as in the first case, and use a very small test data (say 10 points), it works fine

Why is that ? Seems like a resource accessing issue, but I can't seem to understand how is Spark working. What can I do to run the first case ? i.e. Run with Separate train/test data

EDIT

This problem was due to this issue: Error when passing data from a Dataframe into an existing ML VectorIndexerModel As I moved to Spark 2.3.1, it solved the issue

priya khokher
  • 640
  • 7
  • 14

0 Answers0