I have a data set containing both categorical and numerical columns and my target column is also categorical. I am using Scikit library in Python34. I know that Scikit needs all categorical values to be transformed to numerical values before doing any machine learning approach.
How should I transform my categorical columns to numerical values? I tried a lot of thing but I am getting different errors such as "str" object has no 'numpy.ndarray' object has no attribute 'items'.
Here is an example of my data:
UserID LocationID AmountPaid ServiceID Target
29876 IS345 23.9876 FRDG JFD
29877 IS712 135.98 WERS KOI
My dataset is saved in a CSV file, here is the little code I wrote to give you an idea about what I want to do:
#reading my csv file
data_dir = 'C:/Users/davtalab/Desktop/data/'
train_file = data_dir + 'train.csv'
train = pd.read_csv( train_file )
#numeric columns:
x_numeric_cols = train['AmountPaid']
#Categrical columns:
categorical_cols = ['UserID' + 'LocationID' + 'ServiceID']
x_cat_cols = train[categorical_cols].as_matrix()
y_target = train['Target'].as_matrix()
I need x_cat_cols to be converted to numeric values and the add them to x_numeric_cols and so have my complete input (x) values.
Then I need to convert my target function into numeric value as well and make that as my final target (y) column.
Then I want to do a Random Forest using these two complete sets as:
rf = RF(n_estimators=n_trees,max_features=max_features,verbose =verbose, n_jobs =n_jobs)
rf.fit( x_train, y_train )
Thanks for your help!