4

Using ml, Spark 2.0 (Python) and a 1.2 million row dataset, I am trying to create a model that predicts purchase tendency with a Random Forest Classifier. However when applying the transformation to the splitted test dataset the prediction is always 0.

The dataset looks like:

[Row(tier_buyer=u'0', N1=u'1', N2=u'0.72', N3=u'35.0', N4=u'65.81', N5=u'30.67', N6=u'0.0'....

tier_buyer is the field used as a label indexer. The rest of the fields contain numeric data.

Steps

1.- Load the parquet file, and fill possible null values:

parquet = spark.read.parquet('path_to_parquet')
parquet.createOrReplaceTempView("parquet")
dfraw = spark.sql("SELECT * FROM parquet").dropDuplicates()
df = dfraw.na.fill(0)

2.- Create features vector:

features = VectorAssembler(
                inputCols = ['N1','N2'...],
                outputCol = 'features')

3.- Create string indexer:

label_indexer = StringIndexer(inputCol = 'tier_buyer', outputCol = 'label')

4.- Split the train and test datasets:

(train, test) = df.randomSplit([0.7, 0.3])

Resulting train dataset

enter image description here

Resulting Test dataset

enter image description here

5.- Define the classifier:

classifier = RandomForestClassifier(labelCol = 'label', featuresCol = 'features')

6.- Pipeline the stages and fit the train model:

pipeline = Pipeline(stages=[features, label_indexer, classifier])
model = pipeline.fit(train)

7.- Transform the test dataset:

predictions = model.transform(test)

8.- Output the test result, grouped by prediction:

predictions.select("prediction", "label", "features").groupBy("prediction").count().show()

enter image description here

As you can see, the outcome is always 0. I have tried with multiple feature variations in hopes of reducing the noise, also trying from different sources and infering the schema, with no luck and the same results.

Questions

  • Is the current setup, as described above, correct?
  • Could the null value filling on the original Dataframe be source of failure to effectively perform the prediction?
  • In the screenshot shown above it looks like some features are in the form of a tuple and other of a list, why? I'm guessing this could be a possible source of error. (They are representation of Dense and Sparse Vectors)
desertnaut
  • 57,590
  • 26
  • 140
  • 166
TMichel
  • 4,336
  • 9
  • 44
  • 67
  • I have answer a similar question on my personal gist a while ago. You can maybe take a look at it https://gist.github.com/eliasah/8709e6391784be0feb7fe9dd31ae0c0a – eliasah Nov 30 '16 at 15:48
  • 1
    Thank you @eliasah I wil take a look on stratified sampling. Do you know why some of the feature results appear in the form of a tuple and other as a list? – TMichel Nov 30 '16 at 16:19
  • Those are just representations of Dense and Sparse vectors. – eliasah Nov 30 '16 at 16:46
  • What about predictions ? Would you care grouping by predictions after training ? I also believe that I have answer one of your question about vector representation. It's not a source of error. One more question also, does your data have duplicates ? – eliasah Nov 30 '16 at 18:09

3 Answers3

0

It seems your features [N1, N2, ...] are strings. You man want to cast all your features as FloatType() or something along those lines. It may be prudent to fillna() after type casting.

the-ucalegon
  • 117
  • 5
0

Your training dataset is highly imbalanced.

Training samples with label 0 = 896389
Training samples with label 1 = 11066
Total training samples        = 907455

By predicting that every record has a label 0, your model achieves an accuracy of 98.78%(896389/907455), which as you have noted does not work as it completely fails to identify samples with label =1.

To create a better training dataset which has more balanced distribution of samples across the 2 output labels, you can use 2 approaches..

  1. Undersample the majority class, e.g just take 11,000 random samples for label = 0 and 11,000 samples for label = 0
  2. Oversample the minority class, create copies of the data with label = 1.

In this case, your minority class samples are only 1.2% of the majority class samples, so simply duplicating the data will lead to its own issues with overfitting. You can consider SMOTE for synthetic minority oversampling.

shch
  • 16
  • 1
0

Highly imbalanced dataset.

Upsample minority class, downsample majority class, or both.

SMOTE can help and synthetic data will make it even better.