I am trying to teach myself some Machine Learning techniques and I'm having trouble trying to figure out the cause of this error.
An unhandled exception is raised and the program stops, when the model attempts to fit the data.
TypeError was unhandled by user code
Message: float() argument must be a string or a number
I know its related to the training features being passed into the classifier, but i don't understand why the feature isn't valid after the split. I've reduced the data size to only 10 elements while i figure this out.
Querying the 3rd index of the feature before the split yields the dictionary below. I know the split function randomises the split of the data, so the trainFeatures[2] could relate to a different element in the dataFeatures, but each of these elements appear to be valid.
> dataFeatures[2]
{'Very': 1, 'and': 2, 'anyone': 1, 'bed': 1, 'comfortable': 1, 'for': 1, 'full': 1, 'it': 1, 'looking': 1, 'looks...fit': 1, 'of': 1, 'perfectly...would': 1, 'quilt': 1, 'recommend': 1, ...}
Querying the same element after the split yields
> trainFeatures[2]
error: 2L
The error doesn't appear until i attempt to fit the training data to the model.
I also get the same error when i manually split the data.
The data consists of a product name, review and a rating out of 5 The data can be found here: https://d396qusza40orc.cloudfront.net/phoenixassets/amazon_baby.csv
A smaller sample of the data: https://drive.google.com/file/d/0B_fZt96_g2ZGYWtXWjhaVmhEZDQ/view?usp=sharing
Here is my code:
def addWordCountDictionary(data):
import string
from collections import Counter
import pandas as pd
table = string.maketrans("","")
# Add an empty dictioanry to each product.
data['wordCount'] = [dict() for x in range(len(data))]
for index in range(len(data)):
if not pd.isnull(data['review'][index]):
review = data['review'][index]
review.translate(table, string.punctuation)
data['wordCount'][index] = dict(Counter(review.split()))
else:
data['wordCount'][index] = {'':0}
return data
# Main function
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import metrics, cross_validation
# Read the data into a data frame
data = pd.read_csv("amazon_baby.csv")
data = data[0:10]
# Generate data set with word count dictionary
data = addWordCountDictionary(data)
# Assign positive and negative sentiment based on the number of stars in the rating
data['sentiment'] = 0
data['sentiment'][data['rating'] >= 4] = 1
dataFeatures = data['wordCount'] #[['name','review','rating','wordCount']]
Labels = data['sentiment']
trainFeatures, testFeatures, trainLabels, testLabels = train_test_split(dataFeatures,Labels,test_size=0.2, random_state=0)
#trainFeatures = dataFeatures[:8]
#testFeatures = dataFeatures[8:]
#trainLabels = Labels[:8]
#testLabels = Labels[8:]
sentimentModel = LogisticRegression(penalty='l2',C=1)
sentimentModel.fit(trainFeatures, trainLabels)
indecies = np.argsort(sentimentModel.coef_)
print(indecies)
Can anyone help explain/resolve this issue?