0

So I am new to supervised machine learning, but I've been reading books and articles about it and I'm stuck on a problem. (Not stuck, but I don't understand the logic behind classification algorithms). I am trying to classify records as being wrong or not based on historical data. So this is the original data (training data):

Name Office Age  isWrong
F1     1    32      0
F2     2    61      1
F3     1    35      0
F4     0    25      0
F5     1    36      0
F6     2    52      0
F7     2    48      0
F8     1    17      1
F9     2    51      0
F10    0    24      0
F11    4    34      1
F12    0    21      0
F13    2    51      0
F14    0    27      0
F15    3    37      1

(only showing top 15 results of 200 results)

A wrong record is any record which reports an age LOWER than 18 or HIGHER than 60, or an office location that is NOT {0, 1, 2}. I have more records that display a 1 when any of the mentioned conditions are met. I trained my model with this dataset and I created a test dataset to test the results. However, I end up getting 0 on the prediction column of every record. I used a Naïve Bayes approach because this approach assumes independence between the features variables which is my case (no relationship between the office number and age). I know there are other methods like Logistic Regression and SVC(SVM), but I assume that they require a degree of relationship between the features variables. Despite that, I still tried those two approaches and got the same results. Am I doing something wrong? Do I need to specify something before training my model?

Here is what I did (very simple):

NaiveBayes nb = new NaiveBayes().setLabelCol("isWrong");
NaiveBayesModel nbm = nb.fit(dataset);
nbm.transform(dataset2).show();

Here is dataset2 (top 15):

Name   Office Age
F1       9    36  //wrong, office is 9
F2       2    20
F3       1    17
F4       2    43
F5       2    90  // wrong, age is >60
F6       1    36
F7       1    40
F8       2    52
F9       2    49
F10      1    38
F11      0    28
F12      0    18
F13      1    40
F14      1    31
F15      2    45

But like I said, the prediction column displays 0 every time. Any idea why?

1 Answers1

1

I don't know why you are opting for transform(). It just tries to cast the result dtype to the same one as the original column has

To get the probability you should be using the function:

predict_proba(X): Return probability estimates for the test vector X.

The following code should work perfectly in your scenario

NaiveBayes nb = new NaiveBayes().setLabelCol("isWrong");
nb.fit(dataset)
nb.predict_proba(dataset2)
Danyal Imran
  • 2,515
  • 13
  • 21
  • You just said it: "Returns probability estimates for the test vector X" but you are passing my dataset2 as a parameter, but I get your point. I just tried that and I got the same results. I iterate through every record of the dataset2 but I still end up getting a 0 prediction result – Guillermo Herrera Jul 19 '17 at 15:15
  • 1
    I also followed what's on Apache's Spark DOCS: `// train the model NaiveBayesModel model = nb.fit(train); // Select example rows to display. Dataset predictions = model.transform(test);` – Guillermo Herrera Jul 19 '17 at 15:20
  • Its because you train the classifier with dataset and you test its performance using dataset2, I am glad your doubts are cleared @GuillermoHerrera – Danyal Imran Jul 19 '17 at 15:26
  • Forgot to add link to the docs: http://spark.apache.org/docs/latest/ml-classification-regression.html#naive-bayes – Guillermo Herrera Jul 19 '17 at 15:28
  • So you are saying that I can only test my model using the same dataset...? I don't think that's true. You train a model and then test it with different data to observe the results. – Guillermo Herrera Jul 19 '17 at 15:32
  • @GuillermoHerrera thats what I said my self - but you can try testing it out with the same data and observe that getting 100% is still hard even though the model has seen that data and by the way there are alot of ways to test your classifier, try looking up for k-folds cross validation, leave one out, random permutation cross validation. – Danyal Imran Jul 19 '17 at 15:34
  • Why would I want to test my classifier? I'm just curious because I don't know about it. Why isn't it enough to just go for a simple approach as the one I just did? – Guillermo Herrera Jul 19 '17 at 20:08
  • suppose you make a cure for cancer, but you didn't actually test it out on a number of patients, when that cure is released, how sure are you that it will actually cure cancer when you have never tested it? @GuillermoHerrera – Danyal Imran Jul 20 '17 at 04:43
  • But that's now what I am doing here. I train my model with a set of records cataloged as normal or not: `NaiveBayes nb = new NaiveBayes().setLabelCol("isWrong"); nb.fit(dataset)` then I want to predict based on this. I am testing and training my model. However I don't get the results that I want. I don't know what you mean by "test my classifier" if I'm already testing it. – Guillermo Herrera Jul 20 '17 at 15:38
  • I've noticed now that the prediction value for each row in my dataset returns one when the office is higher than three (which is correct). However, it's like if it only takes into account that column. When the age is lower than 18 or higher than 60, I still get a zero (which is wrong). Why is this happening? Is there something I need to tune or add or modify? – Guillermo Herrera Jul 21 '17 at 17:49
  • The problem occurring in your solution is columns with higher values dominating columns with lower values (in this case age). To solve this we use something known as ```feature scaling```, it normalized all the values across all columns so that the classifier has a better idea about the problem at hand. and by testing I mean the literal meaning of test and not your ```test``` dataset, its like create a car and then selling it without actually testing it (drive. brake. handling. etc), so you won't know how will it perform in the actual environment (unseen data). @GuillermoHerrera – Danyal Imran Jul 21 '17 at 19:37
  • try giving this a read https://stackoverflow.com/questions/26225344/why-feature-scaling @GuillermoHerrera – Danyal Imran Jul 21 '17 at 19:37
  • Thanks. Also, this is an example with dummy data. I will be getting the real data and specifications next week. Someone mentioned I'm not getting the results wanted because this is a very simple example, meaning that the criteria is not complex enough. Could this also be a reason? – Guillermo Herrera Jul 21 '17 at 20:15
  • You won't get the appropriate answer always, you need to study the data properly, clean it, transform it, perform maths. After analyzing the data clearly, you'd know what classifier to try and perform the classification/regression task (but not always get good results because data may be very few, very versatile, etc). You used Naive Bayes in this case but there are around hundreds of approaches to solve a particular problem, and the hunger to get better results is what really thrives this community @GuillermoHerrera – Danyal Imran Jul 21 '17 at 21:58