I'm using spark to run LinearRegression. Since my data can not be predicted to a linear model, I added some higher polynomial features to get a better result. This works fine!
Instead of modifying the data myself, I wanted to use the PolynomialExpansion function from the spark library. To find the best solution I used a loop over different degrees. After 10 iterations (degree 10) I ran into the following error:
Caused by: java.lang.IllegalArgumentException: requirement failed: You provided 77 indices and values, which exceeds the specified vector size -30.
I used trainingData with 2 features. This sounds like I have too many features after the polynomial expansion when using a degree of 10, but the vector size -30 confuses me. In order to fix this I started experimenting with different example data and degrees. For testing I used the following lines of code with different testData (with only one entry line) in libsvm format:
val data = spark.read.format("libsvm").load("data/testData2.txt")
val polynomialExpansion = new PolynomialExpansion()
.setInputCol("features")
.setOutputCol("polyFeatures")
.setDegree(10)
val polyDF2 = polynomialExpansion.transform(data)
polyDF2.select("polyFeatures").take(3).foreach(println)
ExampleData: 0 1:1 2:2 3:3
polynomialExpansion.setDegree(11)
Caused by: java.lang.IllegalArgumentException: requirement failed: You provided 333 indices and values, which exceeds the specified vector size 40.
ExampleData: 0 1:1 2:2 3:3 4:4
polynomialExpansion.setDegree(10)
Caused by: java.lang.IllegalArgumentException: requirement failed: You provided 1000 indices and values, which exceeds the specified vector size -183.
ExampleData: 0 1:1 2:2 3:3 4:4 5:5
polynomialExpansion.setDegree(10)
Caused by: java.lang.IllegalArgumentException: requirement failed: You provided 2819 indices and values, which exceeds the specified vector size -548.
It looks like the number of features from the data has an affect on the highest possible degree, but the number of features after the polynomial expansion seems not to be the cause for the error since it differs a lot. It also doesn't crash at the expansion function but when I try to print the new features in the last line of code.
I was thinking that maybe my memory was full at that time, but I checked the system control and there was still some free memory available.
I'm using:
- Eclipse IDE
- Maven project
- Scala 2.11.7
- Spark 2.0.0
- Spark-mllib 2.0.0
- Ubuntu 16.04
I'm glad for any ideas regarding this problem