0

I'm trying to use soundex to convert every word of a line to a hashed version and then using scikit-learn to perform some machine learning on it.

The code goes:

train = []
for line in text:
    a = ' '
    sound = []
    for word in line.split():
        sound.append(soundex(word))
        a = ' '.join(sound)
    train.append(a)

count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(real_train)

But when I do that, I'm getting an error:

X_train_counts = count_vect.fit_transform(real_train)
File "/usr/local/lib/python2.7/dist-packages/sklearn/feature_extraction/text.py", line 780, in fit_transform
vocabulary, X = self._count_vocab(raw_documents, self.fixed_vocabulary)
File "/usr/local/lib/python2.7/dist-packages/sklearn/feature_extraction/text.py", line 710, in _count_vocab
analyze = self.build_analyzer()
File "/usr/local/lib/python2.7/dist-packages/sklearn/feature_extraction/text.py", line 226, in build_analyzer
tokenize = self.build_tokenizer()
File "/usr/local/lib/python2.7/dist-packages/sklearn/feature_extraction/text.py", line 203, in build_tokenizer
token_pattern = re.compile(self.token_pattern)
File "/usr/lib/python2.7/re.py", line 190, in compile
return _compile(pattern, flags)
File "/usr/lib/python2.7/re.py", line 244, in _compile
raise error, v # invalid expression
sre_constants.error: unexpected end of pattern
user3666471
  • 907
  • 11
  • 24
  • Your code does not show how `real_train` is created or what it contains, can you post more code and data – EdChum May 28 '14 at 07:10
  • 1
    Strange, it looks like the tokenization RE in `CountVectorizer` is rejected. Did you modify your scikit-learn installation? Which version is it? – Fred Foo May 28 '14 at 08:21

0 Answers0