spark performing worse than scikit-learn for MultiClass

Question

Currently I have text data and I am trying to predict a class. In my case I have 60 classes to choose from. When I deploy the model in random forest using scikit-learn, I get an f1 score of 78%.

However, I try to setup the model in pyspark and only get 30%. WAY TOO LOW! What is going on? Maybe I am not setting it up right. Also, with pyspark, random forest only is able to predict up to 12 labels where in my case I have 60.

Sci-kit learn code:

rf_model = Pipeline([
    ('featextract',FeatureExtractor()),
    ('union', FeatureUnion(
        transformer_list=[

            # pipeline for tfidf
            ('text', Pipeline([
                    ('selector',ItemSelector(key='TEXT')),
                    ('count_vec',TfidfVectorizer(max_features=5000)),
                    ('tfidf', TfidfTransformer())])),
            # pipeline for ata
            ('ata', Pipeline([
                    ('selector', ItemSelector(key="ATA_SYS_NO")),
                    ('atas',convert2dict()),
                    ('vect',DictVectorizer())]))                   
            ])),
    ('model', OneVsRestClassifier(RandomForestClassifier(n_estimators=200,n_jobs=5))),
    ])

pySpark code

Tokenizer1 = Tokenizer(inputCol="TEXT",outputCol="words")
hashingTF = HashingTF(inputCol="words",outputCol="rawFeatures",numFeatures=4000)
idf = IDF(inputCol="rawFeatures",outputCol="tfidffeatures")
rf = RF(labelCol="componentIndex",featuresCol='tfidffeatures',numTrees=500)
pipeline = Pipeline(stages=[Tokenizer1,hashingTF,idf,labelIndexer,rf])
(trainingData,testData) = df.randomSplit([0.8,0.2])

One the surface of it, it doesn't look like you are doing equivalent things. Firstly, you are using `OneVsRest` on a `RandomForestClassifier` in sklearn, which seems strange since decision trees naturally support multiclass situations. Furthermore, you are using the default arguments to the `sklearn` version, which doesn't seem to correspond to the default parameters for the spark version, i.e. sklearn defaults to `10` estimators, i.e. `10` trees, whereas you are setting `numTrees=500`. To get a fair comparison, you need to make sure all these parameters are similar — juanpa.arrivillaga, Sep 21 '17 at 19:36
Furthermore, your sklearn TfIdf transform seems to be using `max_features=5000`, whereas you have `numFeatures=4000` for your spark approach. And even *more importantly*, why are you using decision trees with a TfIdf transformed text corpus? — juanpa.arrivillaga, Sep 21 '17 at 19:44
Ok, so I agree for this example I shoud've stayed consistent between scikit-learn and random forest. But even if I do set my max features to be 5000 and my number of trees to be 500 I still get a terrible f1 score with spark. I did try OneVsRest with RF in pyspark and this didn't help at all. With or without OneVSRest it predicts all of my test samples with the same class. — dixie.0312, Sep 21 '17 at 19:52
I am using random forest because I am planning to combine continuous features with text. However, if the text isn't even working (which is the heart of the data), then my model is going to be pretty bad. — dixie.0312, Sep 21 '17 at 19:54
How many observations are you dealing with? If your feature-set is around 5000, then for trees to not overfit, you want your observation number to be much larger. But again, saying you want to combine continuous features with text doesn't explain why you are using trees. Continuous features work well with *lots* of different models, and trees are nice for working with *non-continuous* features, so that doesn't explain why you are using trees. Also, your *text has been converted to a continues features space - TfIdf*, but that still — juanpa.arrivillaga, Sep 21 '17 at 19:57
Although, 60 classes is going to be tough any way you cut it. But regardless, you should start by making sure all the parameters to your trees, and the transformations you are doing, are as close to equivalent. Again, **why** OneVsRest? — juanpa.arrivillaga, Sep 21 '17 at 19:58
my number of observations are 450,000. Doing this in sci-kit learn takes 8 hours. Doing it in spark takes less than 15 minutes, however, gives me terrible results. — dixie.0312, Sep 21 '17 at 19:59
agreed I did try naive bayes, but the results are still pretty (50-60% f1 score) weak. However, it is able to at least predict with ALL of the 60 classes whereas random forest only used 12. What puzzles me is that the performance for spark vs. scikit-learn is way different even if the model parameters are slightly different from each other. — dixie.0312, Sep 21 '17 at 20:01
reason why i choose OneVsRest is because it gives me a better performance (at least when using scikit-learn) — dixie.0312, Sep 21 '17 at 20:02
Also, how are you calculating your metric? Usually, it's going to be different for OneVsRest vs other schemes, and there are different ways of doing it in the multiclass setting, so.... — juanpa.arrivillaga, Sep 21 '17 at 20:03

spark performing worse than scikit-learn for MultiClass

Sci-kit learn code:

pySpark code

0 Answers0