-1

I have the following code so far:

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from sklearn import preprocessing
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score


df_train = pd.read_csv('uc_data_train.csv')
del df_train['Unnamed: 0']
temp = df_train['size_womenswear']
del df_train['size_womenswear']
df_train['size_womenswear'] = temp
df_train['count'] = 1
print(df_train.head())
print(df_train.dtypes)

Here is the output from the code above

print(df_train[['size_womenswear', 'count']].groupby('size_womenswear').count()) # Determine number of unique catagories, and number of cases for each catagory
del df_train['count']

df_test = pd.read_csv('uc_data_test.csv')
del df_test['Unnamed: 0']
print(df_test.head())

Here is the output of the code above

print(df_test.dtypes)

df_train.drop(['customer_id','socioeconomic_status','brand','socioeconomic_desc','order_method',
           'first_order_channel','days_since_first_order','total_number_of_orders', 'return_rate'], axis=1, inplace=True)
LE = preprocessing.LabelEncoder() # Create label encoder
df_train['size_womenswear'] = LE.fit_transform(np.ravel(df_train[['size_womenswear']]))
print(df_train.head())
print(df_train.dtypes)

Here is the output for the code above

x = df_train.iloc[:,np.arange(len(df_train.columns)-1)].values # Assign independent values
y = df_train.iloc[:,-1].values                                 # and dependent values
xTrain, xTest, yTrain, yTest = train_test_split(x, y, test_size = 0.25, random_state = 0) # Testing on 75% of the data
model = GaussianNB()
model.fit(xTrain, yTrain)
yPredicted = model.predict(xTest)

#print(yPrediction)
print('Accuracy: ', accuracy_score(yTest, yPredicted))

enter image description here

I am not sure how to include the data that I am using but I am trying to predict the 'size_womenswear'. There are 8 different sizes that I have encoded to predict and I have moved this column to the end of the dataframe. so y is the dependent and x are the independent (all the other columns)

I am using a Gaussian Naive Bayes classifier to try and classify the 8 different sizes and then test on 25% of the data. The results are not very good.

I don't know why I am only getting an accuracy of 61% when I am working with 80,000 rows. I am very new to Machine Learning and would appreciate any assistance. Is there a better method that I could use in this case than Gaussian Naive Bayes?

MRT
  • 793
  • 7
  • 12

1 Answers1

0

can't comment, just throwing out some ideas;

Maybe you need to deal with class imbalance, and try other model that will fit the data better? try the xgboost or lightgbm package given good data they usually perform pretty good in general, but it really depends on the data.

Also the way you split train and test, does the resulting train and test data set has similar distribution for your Y? that's very important.

Last thing, for classification models the performance measurement can be a bit tricky, try some other measurement methods. F1 scores or try to draw a confusion matrix and see what your predictions vs Y looks like. perhaps your model is predicting everything to one or just a few classes.

  • I don't know how to deal with class imbalance but yes for on class there is only a few hundred and the rest have tens of thousands. Are these other packages in sklearn? I will get back to you on the other things you've mentioned – MRT Jun 21 '19 at 21:51
  • xgboost and lightgbm aren't part of sklearn. they would need to be installed separately. probably start with xgboost, as it gives slightly better result; lightgbm training speed is much faster, but the model is weaker in performance, just slightly. – Aiden Zhao Jun 21 '19 at 22:06
  • If possible I would very much prefer to stay within sklearn or anaconda. That is what I'm most familiar with and what I want to gain experience using – MRT Jun 21 '19 at 22:10
  • I would really suggest you to try out packages outside of sklearn, is it a good package for learning data science/machine learning algorithms/model/techniques, but they are not the best for every case. if you go to kaggle.com and see other people's code, rarely you will find sklearn being used for fitting any model, the other utilities in the sklearn package is good to have. Which means you can combine utilities tools in sklearn with any other machine learning package. – Aiden Zhao Jun 21 '19 at 22:25