0

I am trying to teach myself some Machine Learning techniques and I'm having trouble trying to figure out the cause of this error.

An unhandled exception is raised and the program stops, when the model attempts to fit the data.

TypeError was unhandled by user code
Message: float() argument must be a string or a number

I know its related to the training features being passed into the classifier, but i don't understand why the feature isn't valid after the split. I've reduced the data size to only 10 elements while i figure this out.

Querying the 3rd index of the feature before the split yields the dictionary below. I know the split function randomises the split of the data, so the trainFeatures[2] could relate to a different element in the dataFeatures, but each of these elements appear to be valid.

> dataFeatures[2]
{'Very': 1, 'and': 2, 'anyone': 1, 'bed': 1, 'comfortable': 1, 'for': 1, 'full': 1, 'it': 1, 'looking': 1, 'looks...fit': 1, 'of': 1, 'perfectly...would': 1, 'quilt': 1, 'recommend': 1, ...}

Querying the same element after the split yields

> trainFeatures[2]
error: 2L

The error doesn't appear until i attempt to fit the training data to the model.

I also get the same error when i manually split the data.

The data consists of a product name, review and a rating out of 5 The data can be found here: https://d396qusza40orc.cloudfront.net/phoenixassets/amazon_baby.csv

A smaller sample of the data: https://drive.google.com/file/d/0B_fZt96_g2ZGYWtXWjhaVmhEZDQ/view?usp=sharing

Here is my code:

def addWordCountDictionary(data):
    import string
    from collections import Counter
    import pandas as pd

    table = string.maketrans("","")

    # Add an empty dictioanry to each product.
    data['wordCount'] = [dict() for x in range(len(data))]

    for index in range(len(data)):
        if not pd.isnull(data['review'][index]):
            review = data['review'][index]
            review.translate(table, string.punctuation)
            data['wordCount'][index] = dict(Counter(review.split()))
        else:
            data['wordCount'][index] = {'':0}

    return data

# Main function
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import metrics, cross_validation

# Read the data into a data frame
data = pd.read_csv("amazon_baby.csv")
data = data[0:10]

# Generate data set with word count dictionary
data = addWordCountDictionary(data)
# Assign positive and negative sentiment based on the number of stars in the rating
data['sentiment'] = 0
data['sentiment'][data['rating'] >= 4] = 1


dataFeatures = data['wordCount'] #[['name','review','rating','wordCount']]
Labels = data['sentiment']
trainFeatures, testFeatures, trainLabels, testLabels = train_test_split(dataFeatures,Labels,test_size=0.2, random_state=0)
#trainFeatures = dataFeatures[:8]
#testFeatures = dataFeatures[8:]
#trainLabels = Labels[:8]
#testLabels = Labels[8:]

sentimentModel = LogisticRegression(penalty='l2',C=1)
sentimentModel.fit(trainFeatures, trainLabels)
indecies = np.argsort(sentimentModel.coef_)
print(indecies)

Can anyone help explain/resolve this issue?

theotheraussie
  • 495
  • 1
  • 4
  • 14
  • I can't reproduce this error because your code raises other errors; for example, `trainFeaturesStandard` is not defined, and `review.translate()` is apparently supposed to take one fewer argument than you've given it. It would help if you edit your code to be self-contained and ensure that it actually illustrates the error you're asking about. (It would also help if you could provide a smaller CSV file to illustrate the error, ideally with just a few lines.) – David Z Aug 12 '17 at 05:02
  • @DavidZ I've updated the code snippet to correct the undefined variable. The translate function works fine for me, It may be related to my version of Python (2.7). I've include an image of the first few lines of the file to assist – theotheraussie Aug 12 '17 at 07:07
  • OK, I assumed it was Python 3 because you used parentheses with `print()` and there was no other indication of the version. I'd suggest adding the [tag:python-2.7] tag and mentioning the version in the question, because that does seem to be important. (I can reproduce with 2.7) The image is not helpful; I'm saying you should offer a smaller download _instead_ of the full file. Extract just the first 5 lines or so and link to that excerpt, or even trim down each entry to just a few words and include that in the question directly. Also, the full stack trace would be quite helpful to include. – David Z Aug 12 '17 at 07:23
  • @DavidZ. I've tried to make it clearer in the question now. There is no stack trace because the program just stops when the exception is raised. I've added a link to a cut down version of the data. – theotheraussie Aug 12 '17 at 09:03
  • If you're not getting a stack trace, you should definitely mention that in the question because something rather strange is going on. I get a stack trace when I run your code. – David Z Aug 12 '17 at 17:40

0 Answers0