SparkException: Chi-square test expect factors

Question

I have a dataset containing 42 features and 1 label. I want to apply the selection method chi square selector of the library spark ML before executing Decision tree for the detection of anomaly but I meet this error during the applciation of chi square selector:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 17.0 failed 1 times, most recent failure: Lost task 0.0 in stage 17.0 (TID 45, localhost, executor driver): org.apache.spark.SparkException: Chi-square test expect factors (categorical values) but found more than 10000 distinct values in column 11.

Here is my source code:

from pyspark.ml.feature import ChiSqSelector
selector = ChiSqSelector(numTopFeatures=1, featuresCol="features",outputCol="features2", labelCol="label")
result = selector.fit(dfa1).transform(dfa1)
result.show()

please add result of dfa1.show(1) and dfa1.printSchema() for better understanding your problem, are you sure that your features col is array/vector type? — chlebek, Oct 29 '19 at 13:27
@chlebek show(1): +--------------------+-----+ | features|label| +--------------------+-----+ |[0.121478,0.0,0.0...| 0| +--------------------+-----+ — Med Othman, Oct 29 '19 at 13:53
@chlebek printSchea(): |-- features: vector (nullable = true) |-- label: integer (nullable = true) thank you — Med Othman, Oct 29 '19 at 13:54

chlebek · Answer 1 · 2019-10-29T14:03:31.243

0

As you can see in error msg your features column contains more than 10000 distinct values in vector and looks like they are continous not categorical , ChiSq can handle only 10k categories and you can't increase this value.

  /**
   * Max number of categories when indexing labels and features
   */
  private[spark] val maxCategories: Int = 10000

In this case you can use VectorIndexer with .setMaxCategories() parameter < 10k to prepare your data. You can try other methods to prepare data but it will not work until your count of distinct values in vector is > 10k.

edited Oct 29 '19 at 14:03

answered Oct 29 '19 at 13:56

chlebek

2,431
1
8
20

please when i test VectorIndexer i get the same error – Med Othman Oct 30 '19 at 08:04
-Other question, my goal is to have the best accuracy with the minimum number of features and I want to test ChiSquare to compare it with other selection methods. -So, does it make sense to argue and say that other methods are more reliable and more compatible with my dataset than chi-square that does not support features> 10K --------thank you in advance – Med Othman Oct 30 '19 at 08:08

SparkException: Chi-square test expect factors

1 Answers1

Linked