PySpark MLLIB Random Forest: prediction always 0

Question

Using ml, Spark 2.0 (Python) and a 1.2 million row dataset, I am trying to create a model that predicts purchase tendency with a Random Forest Classifier. However when applying the transformation to the splitted test dataset the prediction is always 0.

The dataset looks like:

[Row(tier_buyer=u'0', N1=u'1', N2=u'0.72', N3=u'35.0', N4=u'65.81', N5=u'30.67', N6=u'0.0'....

tier_buyer is the field used as a label indexer. The rest of the fields contain numeric data.

Steps

1.- Load the parquet file, and fill possible null values:

parquet = spark.read.parquet('path_to_parquet')
parquet.createOrReplaceTempView("parquet")
dfraw = spark.sql("SELECT * FROM parquet").dropDuplicates()
df = dfraw.na.fill(0)

2.- Create features vector:

features = VectorAssembler(
                inputCols = ['N1','N2'...],
                outputCol = 'features')

3.- Create string indexer:

label_indexer = StringIndexer(inputCol = 'tier_buyer', outputCol = 'label')

4.- Split the train and test datasets:

(train, test) = df.randomSplit([0.7, 0.3])

Resulting train dataset

Resulting Test dataset

5.- Define the classifier:

classifier = RandomForestClassifier(labelCol = 'label', featuresCol = 'features')

6.- Pipeline the stages and fit the train model:

pipeline = Pipeline(stages=[features, label_indexer, classifier])
model = pipeline.fit(train)

7.- Transform the test dataset:

predictions = model.transform(test)

8.- Output the test result, grouped by prediction:

predictions.select("prediction", "label", "features").groupBy("prediction").count().show()

As you can see, the outcome is always 0. I have tried with multiple feature variations in hopes of reducing the noise, also trying from different sources and infering the schema, with no luck and the same results.

Questions

Is the current setup, as described above, correct?
Could the null value filling on the original Dataframe be source of failure to effectively perform the prediction?
~~In the screenshot shown above it looks like some features are in the form of a tuple and other of a list, why? I'm guessing this could be a possible source of error.~~ (They are representation of Dense and Sparse Vectors)

I have answer a similar question on my personal gist a while ago. You can maybe take a look at it https://gist.github.com/eliasah/8709e6391784be0feb7fe9dd31ae0c0a — eliasah, Nov 30 '16 at 15:48
Thank you @eliasah I wil take a look on stratified sampling. Do you know why some of the feature results appear in the form of a tuple and other as a list? — TMichel, Nov 30 '16 at 16:19
What about predictions ? Would you care grouping by predictions after training ? I also believe that I have answer one of your question about vector representation. It's not a source of error. One more question also, does your data have duplicates ? — eliasah, Nov 30 '16 at 18:09

score 0 · Answer 1 · answered Mar 14 '18 at 21:03

0

It seems your features [N1, N2, ...] are strings. You man want to cast all your features as FloatType() or something along those lines. It may be prudent to fillna() after type casting.

answered Mar 14 '18 at 21:03

the-ucalegon

117
5

score 0 · Answer 2 · answered Apr 26 '23 at 19:41

Your training dataset is highly imbalanced.

Training samples with label 0 = 896389
Training samples with label 1 = 11066
Total training samples        = 907455

By predicting that every record has a label 0, your model achieves an accuracy of 98.78%(896389/907455), which as you have noted does not work as it completely fails to identify samples with label =1.

To create a better training dataset which has more balanced distribution of samples across the 2 output labels, you can use 2 approaches..

Undersample the majority class, e.g just take 11,000 random samples for label = 0 and 11,000 samples for label = 0
Oversample the minority class, create copies of the data with label = 1.

In this case, your minority class samples are only 1.2% of the majority class samples, so simply duplicating the data will lead to its own issues with overfitting. You can consider SMOTE for synthetic minority oversampling.

score 0 · Answer 3 · answered Apr 26 '23 at 21:58

0

Highly imbalanced dataset.

Upsample minority class, downsample majority class, or both.

SMOTE can help and synthetic data will make it even better.

answered Apr 26 '23 at 21:58

Gonçalo Martins Ribeiro

11
2

PySpark MLLIB Random Forest: prediction always 0

Steps

Questions

3 Answers3