How to get predict from string data in sklearn

Question

When I convert data from a pandas dataframe to sklearn so I can make predictions. String data becomes problematic. So I used labelencoder but it seems to limit me to using the encoded data instead of the source string data.

in predict method of sklearn i want to predict on this input:

learn_to_machine=dtc.fit(X,Y)
test=[
    [128, 6 ,50, 'mobile_phone', 'Samsung', 6000],
    [512, 8, 65, 'mobile_phone', 'Huawei',5000]
        ]
answer=learn_to_machine.predict(test)
print(answer[0])
print(answer[1])
# 11399000
# 15304000

rather than this one:

learn_to_machine=dtc.fit(X,Y)
test=[
    [128, 6 ,50, 0, 2, 6000],
    [512, 8, 65, 0, 3,5000]
        ]
answer=learn_to_machine.predict(test)
print(answer[0])
print(answer[1])
# 11399000
# 15304000

If it helps, here's all my code:


import sqlalchemy
import pandas as pd
read_engine=sqlalchemy.create_engine('mysql+mysqlconnector://root:@localhost/six')
conn = read_engine.connect()
df_new=pd.read_sql_table('mobile1' ,con= conn )
df_new['price']=df_new['price'].astype(int)
df_new['ram']=df_new['ram'].astype(int)
df_new['battery']=df_new['battery'].astype(int)
df_new['size']=df_new['size'].astype(float)
df_new['camera']=df_new['camera'].mask(df_new['camera'] == '')
df_new['camera']=df_new['camera'].mask(df_new['camera'] == ' ')
df_new['camera']=df_new['camera'].mask(df_new['camera'] == '  ')
df_new['camera']=df_new['camera'].fillna(0)
df_new['camera']=df_new['camera'].astype(float)


X=df_new[['ram','size','camera','product','Brand','battery']].values
Y=df_new[['price']].values


from sklearn import preprocessing
product_enc=preprocessing.LabelEncoder()
product_enc.fit([char for char in X[:,4]])
X[:,4]=product_enc.transform(X[:,4])
product_enc.fit([ char for char in X[:,3]])
X[:,3]=product_enc.transform(X[:,3])
from sklearn import tree
dtc=tree.DecisionTreeClassifier()
learn_to_machine=dtc.fit(X,Y)

# when i execute with this its ok
test=[
    [128, 6 ,50, 0, 2, 6000],
    [512, 8, 65, 0, 3,5000]
        ]

answer=learn_to_machine.predict(test)
print(answer[0])
print(answer[1])
# 11399000
# 15304000

when i tried execute tat with this :

test=[
    [128, 6 ,50, 'mobile_phone', 'Samsung', 6000],
    [512, 8, 65, 'mobile_phone', 'Huawei',5000]
        ]

this error raised: ValueError: could not convert string to float: 'mobile_phone'

ciaran haines · Answer 1 · 2023-03-28T09:19:07.167

0

Edit: you have used a list comprehension for your fit method, which is unnecessary. Here are 2 versions. First you probably should change your two different labelencoders to have 2 different names, then you can transform your new raw data automatically.

With list comprehension:

product_enc=preprocessing.LabelEncoder()
product_enc.fit([char for char in X[:,3]])
X[:,3]=product_enc.transform(X[:,3])

company_enc=preprocessing.LabelEncoder()
company_enc.fit([ char for char in X[:,4]])
X[:,4]=company_enc.transform(X[:,4])

test=[
    [128, 6 ,50, 'mobile_phone', 'Samsung', 6000],
    [512, 8, 65, 'mobile_phone', 'Huawei',5000]
        ]
test_transform = test
test_transform[:,3] = product_enc.transform([char for char in test[:,3]])
test_transform[:,4] = company_enc.transform([char for char in test[:,4]])

With array function:

product_enc=preprocessing.LabelEncoder()
product_enc.fit(X[:,3])
X[:,3]=product_enc.transform(X[:,3])

company_enc=preprocessing.LabelEncoder()
company_enc.fit(X[:,4])
X[:,4]=company_enc.transform(X[:,4])

test=[
    [128, 6 ,50, 'mobile_phone', 'Samsung', 6000],
    [512, 8, 65, 'mobile_phone', 'Huawei',5000]
        ]
test_transform = test
test_transform[:,3] = product_enc.transform(test[:,3])
test_transform[:,4] = company_enc.transform(test[:,4])

answer=learn_to_machine.predict(test_transform)

Thart should work, but I haven't run it on a full sample code

edited Mar 28 '23 at 09:19

answered Mar 27 '23 at 13:47

ciaran haines

294
1
11

thank you. but in the line of `test_transform[:,3] = product_enc.transform(test[:,3]) ` it raise an error : `TypeError: list indices must be integers or slices, not tuple` – M.Namjoo Mar 27 '23 at 18:00
doesn't it work with `inverse_transform` ? – M.Namjoo Mar 28 '23 at 03:51
Inverse transform turns the number codes into the original labels (for outputting results). transform turns labels into numbers (for changing inputs). In my answer I've mixed up a list comprehension with an array function. I'll edit the answer now. – ciaran haines Mar 28 '23 at 08:50
unfortunately it still gives the error : TypeError: `list indices must be integers or slices, not tuple` on `test_transform[:,3] = product_enc.transform([char for char in test[:,3]])` in your both method. but ,dear `@ciaranhaines` i appreciate for your effort. – M.Namjoo Mar 28 '23 at 11:57
I think I have further help for you. `test=[ [128, 6 ,50, 'mobile_phone', 'Samsung', 6000], [512, 8, 65, 'mobile_phone', 'Huawei',5000] ]` is a list of lists, but the other data you have is a numpy array. If I'm ight, this means his means you should be able to use convert your list of lists to an array - `test_transform = np.array(test)` and `test_transform[:,3] = product_enc.transform(np.array(test[:,3]))` or make a bigger change to use list index notation instead of, for example `test[:,3]` – ciaran haines Mar 28 '23 at 16:42
this asked similar question, but i still don't know how to use it's response : [(https://stackoverflow.com/questions/46919816/how-to-get-original-data-from-normalized-array)] – M.Namjoo Apr 10 '23 at 20:37

How to get predict from string data in sklearn

1 Answers1