10

I have two numpy arrays, X_train and Y_train, where the first of dimensions (700,1000) is populated by the values 0, 1, 2, 3, 4, and 10. The second of dimensions (700,) is populated by the values 'fresh' or 'rotten', since I'm working with Rotten Tomatoes's API. For some reason, when I execute:

nb = MultinomialNB()
nb.fit(X_train, Y_train)

I get:

ValueError: Unknown label type

I tried building a smaller pair of arrays:

print xs, '\n', ys

gives

[[0 0 0 0 1]
 [1 0 0 2 5]
 [3 2 5 5 0]
 [3 2 0 0 1]
 [1 5 1 0 0]]

['rotten' 'fresh' 'fresh' 'rotten' 'fresh']

and the multinomial NB fit gives no Unknown Label error. Any ideas on why this is happening?

I also checked the unique values in X_train, Y_train with numpy.unique and it doesn't seem like there are any weird or mistyped labels -- it's all 'fresh' or 'rotten'.

My code for generating X_train and Y_train:

def make_xy(critics, vectorizer=None):
    stext = critics['quote'].tolist() # need to have a list
    if vectorizer == None:
        vectorizer = CountVectorizer(min_df=0)
    vectorizer.fit(stext)
    X = vectorizer.transform(stext).toarray() # this is X
    Y = np.asarray(critics['fresh'])
    return X[0:1000,0:1000], Y[0:1000] # this is X_train, Y_train

where 'critics' is a pandas dataframe imported from a CSV file (https://www.dropbox.com/s/0lu5oujfm483wtr/critics.csv), and cleaned of any missing data:

critics = pd.read_csv('critics.csv')
critics = critics[~critics.quote.isnull()]
critics = critics[critics.fresh != 'none']
critics = critics[critics.quote.str.len() > 0]
covariance
  • 6,833
  • 7
  • 23
  • 24

2 Answers2

16

The problems seems to be the dtype of y. looks like numpy didnt manage to figure out it was a string. so it was set to a generic object. If you change:
Y = np.asarray(critics['fresh']) to Y = np.asarray(critics['fresh'], dtype="|S6") i think it should work.

M4rtini
  • 13,186
  • 4
  • 35
  • 42
0

I also faced the same problem. Numpy sometimes fails t detect datatype of array. So, we give it explicitly. here is the documentation of all the types by numpy. Select datatype according to your requirement and provide it as "dtype=" attribute.