0

I am trying to learn some classification in Scikit-learn. However, I couldn't figure out what this error means.

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

data_frame = pd.read_csv('data.csv', header=0) 
data_in_numpy = data_frame.values 

c = CountVectorizer()
c.fit_transform(data_in_numpy.data)

This throws an error:

NotImplementedError: multi-dimensional sub-views are not implemented

How can I go around this issue? One record from my csv file looks like:

Time   Directors    Actors   Rating   Label
123    Abc, Def     A, B,c    7.2      1

I suppose this error is due to the fact that there are more than one values under Directors or Actors column. Any help would be appreciated. Thanks,

Jhooma
  • 3
  • 3

1 Answers1

0

According to the docstring, sklearn.feature_extraction.text.CountVectorizer will:

Convert a collection of text documents to a matrix of token counts

So then why, I wonder, are you inputing numerical values?

Try transforming only the strings (directors and actors):

data_in_numpy['X'] = data_frame[['Directors', 'Actors']].apply(lambda x: ' '.join(x), axis=1)
data_in_numpy = data_frame['X'].values

First though, you might want to clean the data up by removing the commas.

data_frame['Directors'] = data_frame['Directors'].str.replace(',', ' ')
data_frame['Actors'] = data_frame['Actors'].str.replace(',', ' ')
Alex
  • 12,078
  • 6
  • 64
  • 74
  • Now it throws an error of AttributeError: 'numpy.ndarray' object has no attribute 'lower' While I can fit transform one one feature but can't for more than one. c.fit_transform(d['Writer'].values) However, c.fit_transform(d[['Actors', 'Directors']].values) raises an AttributeError: 'numpy.ndarray' object has no attribute 'lower'. – Jhooma Dec 10 '16 at 15:02
  • So then the count vectorizer is expecting only one column of data. You should either do each separately or create a new df column to transform. Please see the changes I've made to my answer. – Alex Dec 11 '16 at 23:08