Data type for gaussian Naive bayes classifivation using sklearn, how to clean data

Question

I'm trying to classify mobiles according to their features but when I apply the gaussian NB code through sklearn , I'm unable to do so because of the following error : the code :

clf = GaussianNB()
clf.fit(X_train,y_train)
GaussianNB()
accuracy = clf.score(X_test,y_test)
print(accuracy)

error:

ValueError                                Traceback (most recent call last)
<ipython-input-18-e9515ccc2439> in <module>()
      2 clf.fit(X_train,y_train)
      3 GaussianNB()
----> 4 accuracy = clf.score(X_test,y_test)
      5 print(accuracy)

/Users/kiran/anaconda/lib/python3.6/site-packages/sklearn/base.py in score(self, X, y, sample_weight)
    347         """
    348         from .metrics import accuracy_score
--> 349         return accuracy_score(y, self.predict(X), sample_weight=sample_weight)
    350 
    351 

/Users/kiran/anaconda/lib/python3.6/site-packages/sklearn/naive_bayes.py in predict(self, X)
     63             Predicted target values for X
     64         """
---> 65         jll = self._joint_log_likelihood(X)
     66         return self.classes_[np.argmax(jll, axis=1)]
     67 

/Users/kiran/anaconda/lib/python3.6/site-packages/sklearn/naive_bayes.py in _joint_log_likelihood(self, X)
    422         check_is_fitted(self, "classes_")
    423 
--> 424         X = check_array(X)
    425         joint_log_likelihood = []
    426         for i in range(np.size(self.classes_)):

/Users/kiran/anaconda/lib/python3.6/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
    380                                       force_all_finite)
    381     else:
--> 382         array = np.array(array, dtype=dtype, order=order, copy=copy)
    383 
    384         if ensure_2d:

ValueError: could not convert string to float:

My dataset has been scraped so it contains string as well as float values. It would be helpful if someone could suggest me how I can clean the data and avoid the error .

Vizag · Answer 1 · 2018-05-27T06:52:54.577

1

ValueError: could not convert string to float

I think this says it all. You need to have float as consistent datatype in your dataset.

To convert a string in python to float:

>>> a = "123.345"
>>> float(a)
>>> 123.345
>>> int(float(a))
>>> 123

edited May 27 '18 at 06:52

answered May 27 '18 at 06:46

Vizag

743
1
7
30

Also, it is easier to answer a question if all the relevant information is provided in the question. You should post a snippet (a small subset) of the dataset with the question too. – Vizag May 27 '18 at 06:48
Is there a way to do that to string datasets ? – Kiran Pun May 27 '18 at 06:50
I will do that ! I will edit and add the data set. – Kiran Pun May 27 '18 at 06:50
I have edited my answer to include how to convert string to float. Do check. – Vizag May 27 '18 at 06:53
So is your problem fixed now? – Vizag May 27 '18 at 07:00
Yes, the previous error isn't present anymore but i'm getting the following error now --------------------------------------------------------------------------- NameError Traceback (most recent call last) in () ----> 1 clf = GaussianNB() 2 clf.fit(X_train,y_train) 3 GaussianNB() 4 ccuracy = clf.score(X_test.astype('float'),y_test.astype('float')) 5 print(accuracy) NameError: name 'GaussianNB' is not defined ....... I have imported GaussianNB thoough – Kiran Pun May 27 '18 at 07:03
Have you installed sklearn on your computer? – Vizag May 27 '18 at 07:11
It doesn't come pre-installed. You have to install it separately. – Vizag May 27 '18 at 07:12
I installed it on conda and no longer have that error. thank you – Kiran Pun May 27 '18 at 07:30

score 1 · Accepted Answer · answered May 27 '18 at 06:52

1

try the following:

accuracy = clf.score(X_test.astype('float'),y_test.astype('float'))

answered May 27 '18 at 06:52

Louis Ng

533
1
7
16

it still shows the same error . – Kiran Pun May 27 '18 at 07:30
Can you do print(X_test) and print(Y_test)to show us what's inside? – Louis Ng May 27 '18 at 07:51
'print(X_test)' [[32 '4' 12 1] [8 '1' 8 0] [32 '3' 13 0] [64 '6' 16 1] [16 '2' 12 0] [16 '3' 13 0] [16 '2' 8 0] [64 '4' 16 1] [128 '6' 16 1] [8 '1' 8 0] [128 '6' 12 1] [64 '4' 12 1] [64 '4' 16 0] [128 '4' 12 1] [32 '4' 16 0] [32 nan 16 1] [16 nan 8 0] [16 nan 8 0] [16 '3' 13 0] [64 '6' 16 1]] 'print(y_test)'[1 3 2 2 3 3 3 2 2 3 3 1 2 1 2 2 3 3 3 2] I did change the nan valuse to -99999 but it still shows nan – Kiran Pun May 27 '18 at 12:07

Data type for gaussian Naive bayes classifivation using sklearn, how to clean data

2 Answers2