Currently I have text data and I am trying to predict a class. In my case I have 60 classes to choose from. When I deploy the model in random forest using scikit-learn, I get an f1 score of 78%.
However, I try to setup the model in pyspark and only get 30%. WAY TOO LOW! What is going on? Maybe I am not setting it up right. Also, with pyspark, random forest only is able to predict up to 12 labels where in my case I have 60.
Sci-kit learn code:
rf_model = Pipeline([
('featextract',FeatureExtractor()),
('union', FeatureUnion(
transformer_list=[
# pipeline for tfidf
('text', Pipeline([
('selector',ItemSelector(key='TEXT')),
('count_vec',TfidfVectorizer(max_features=5000)),
('tfidf', TfidfTransformer())])),
# pipeline for ata
('ata', Pipeline([
('selector', ItemSelector(key="ATA_SYS_NO")),
('atas',convert2dict()),
('vect',DictVectorizer())]))
])),
('model', OneVsRestClassifier(RandomForestClassifier(n_estimators=200,n_jobs=5))),
])
pySpark code
Tokenizer1 = Tokenizer(inputCol="TEXT",outputCol="words")
hashingTF = HashingTF(inputCol="words",outputCol="rawFeatures",numFeatures=4000)
idf = IDF(inputCol="rawFeatures",outputCol="tfidffeatures")
rf = RF(labelCol="componentIndex",featuresCol='tfidffeatures',numTrees=500)
pipeline = Pipeline(stages=[Tokenizer1,hashingTF,idf,labelIndexer,rf])
(trainingData,testData) = df.randomSplit([0.8,0.2])