4

Small question regarding prediction/forecast using SparkML and Naive Bayes please.

I have a very simple dataset, which is just time stamp, representing a day, and how many pancakes sold that day:

dataSetPancakes.show();

+----------+-----+
|      time|label|
+----------+-----+
|1622505600|    1|
|1622592000|    0|
|1622678400|    3|
|1622764800|    1|
|1622851200|    1|
|1622937600|    1|
|1623024000|    1|
|1623110400|    2|
|1623196800|    2|
|1623283200|    0|
+----------+-----+
only showing top 10 rows"

Very simple, I just want to predict how much pancake will be sold tomorrow, the day after, etc...

Therefore, I tried the Naive Bayes model, following the tutorial here https://spark.apache.org/docs/latest/ml-classification-regression.html#naive-bayes, I wrote:

       VectorAssembler vectorAssembler = new VectorAssembler().setInputCols(new String[]{"time"}).setOutputCol("features");
        Dataset<Row> vectorData = vectorAssembler.transform(dataSetPancakes);
 NaiveBayes naiveBayes = new NaiveBayes();
        NaiveBayesModel model = naiveBayes.fit(vectorData);
        Dataset<Row> predictions = model.transform(vectorData);
        predictions.show();
    model.predict(new DenseVector(new double[]{getTomorrowTimestamp()})));

I do even see results such as:

-RECORD 0--------------------------------------------------------------------------------------------------------------
 time          | 1622505600                                                                                            
 label         | 1                                                                                                     
 features      | [1.6225056E9]                                                                                         
 rawPrediction | [-0.9400072584914714,-1.0831081021321447,-1.702147310538368,-2.5494451709255714,-4.564348191467836]   
 probability   | [0.39062499999999994,0.33854166666666663,0.18229166666666666,0.07812500000000001,0.01041666666666667] 
 prediction    | 0.0                                                                                                   
-RECORD 1--------------------------------------------------------------------------------------------------------------
 time          | 1622592000                                                                                            
 label         | 0                                                                                                     
 features      | [1.622592E9]                                                                                          
 rawPrediction | [-0.9400072584914714,-1.0831081021321447,-1.702147310538368,-2.5494451709255714,-4.564348191467836]   
 probability   | [0.39062499999999994,0.33854166666666663,0.18229166666666666,0.07812500000000001,0.01041666666666667] 
 prediction    | 0.0                                                                                                   

But as for the prediction itself, it is always showing 0.0 for tomorrow, unfortunately.

May I ask what is the root cause of this issue please?

Thank you

PatPanda
  • 3,644
  • 9
  • 58
  • 154

1 Answers1

6

You should not train with the same dataset you are using for the prediction. Otherwise, you would not do any prediction.

Dataset<Row>[] splits = vectorData.randomSplit(new double[]{0.6, 0.4}, 1234L);
Dataset<Row> train = splits[0];
Dataset<Row> test = splits[1];

Also, it is absolutely possible that the algorithm learns that for any day the probable outcome is 0. As you should know that there is no real relation between a date and the sales count. The dates are not recurring, and therefor no true prediction can be made. The Bayes Algorithm also doesn't grasp that these entries are actually a series of events. It just calculates how high the probable value "label" is when the value "feature" is, for example, "1622505600".

I suggest something like day of the week as the feature, as those would be recurring, and it would make more sense to see on what weekday sales are especially high.

Alternatively, you could give it a second feature like yesterday's sales. This would allow the algorithm to actually do predictions on the day before.

3Fish
  • 640
  • 3
  • 18