I am running an RF model on spark, [https://spark.apache.org/docs/2.0.0/ml-classification-regression.html#random-forest-classifier]
My issue is that if I load 2 different dataframes for train and test, eg:
val Array(trainingData, testData) = Array(convertedVecDF, convertedVecDF_test)
I get the above "java.util.NoSuchElementException: key not found: -1.0" being the cause of the error, but when I do the following, I get none
val data = convertedVecDF.union(convertedVecDF_test)
val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3))
And run the code in the link, it works fine.
The class of all variables: data, convertedVecDF, convertedDF_test, trainingData, testData is Class[_ <: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]] = class org.apache.spark.sql.Dataset
When I run seperate out the variables as in the first case, and use a very small test data (say 10 points), it works fine
Why is that ? Seems like a resource accessing issue, but I can't seem to understand how is Spark working. What can I do to run the first case ? i.e. Run with Separate train/test data
EDIT
This problem was due to this issue: Error when passing data from a Dataframe into an existing ML VectorIndexerModel As I moved to Spark 2.3.1, it solved the issue