I have two numpy arrays, X_train and Y_train, where the first of dimensions (700,1000) is populated by the values 0, 1, 2, 3, 4, and 10. The second of dimensions (700,) is populated by the values 'fresh' or 'rotten', since I'm working with Rotten Tomatoes's API. For some reason, when I execute:
nb = MultinomialNB()
nb.fit(X_train, Y_train)
I get:
ValueError: Unknown label type
I tried building a smaller pair of arrays:
print xs, '\n', ys
gives
[[0 0 0 0 1]
[1 0 0 2 5]
[3 2 5 5 0]
[3 2 0 0 1]
[1 5 1 0 0]]
['rotten' 'fresh' 'fresh' 'rotten' 'fresh']
and the multinomial NB fit gives no Unknown Label error. Any ideas on why this is happening?
I also checked the unique values in X_train, Y_train with numpy.unique and it doesn't seem like there are any weird or mistyped labels -- it's all 'fresh' or 'rotten'.
My code for generating X_train and Y_train:
def make_xy(critics, vectorizer=None):
stext = critics['quote'].tolist() # need to have a list
if vectorizer == None:
vectorizer = CountVectorizer(min_df=0)
vectorizer.fit(stext)
X = vectorizer.transform(stext).toarray() # this is X
Y = np.asarray(critics['fresh'])
return X[0:1000,0:1000], Y[0:1000] # this is X_train, Y_train
where 'critics' is a pandas dataframe imported from a CSV file (https://www.dropbox.com/s/0lu5oujfm483wtr/critics.csv), and cleaned of any missing data:
critics = pd.read_csv('critics.csv')
critics = critics[~critics.quote.isnull()]
critics = critics[critics.fresh != 'none']
critics = critics[critics.quote.str.len() > 0]