I was going to try using fuzzywuzzy with a tuned acceptable score parameter essentially it would check if the word is in the vocabulary as-is, and if not, it would ask fuzzywuzzy to choose the best fuzzy match, and accept that for the list of tokens if it was at least a certain score.
If this isn't the best approach to deal with a fair amount of typos and slightly differently spelled but similar words, i'm open to suggestion.
The problem is that the subclass keeps complaining that it has an empty vocabulary, which doesn't make any sense, as when I use a regular count vectorizer in the same part of my code it works fine.
it spits many errors like this: ValueError: empty vocabulary; perhaps the documents only contain stop words
What am I missing? I don't have it doing anything special yet. It ought to work like normal:
class FuzzyCountVectorizer(CountVectorizer):
def __init__(self, input='content', encoding='utf-8', decode_error='strict',
strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None,
token_pattern="(?u)\b\w\w+\b", ngram_range=(1, 1), analyzer='word',
max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False,
dtype=numpy.int64, min_fuzzy_score=80):
super().__init__(
input=input, encoding=encoding, decode_error=decode_error, strip_accents=strip_accents,
lowercase=lowercase, preprocessor=preprocessor, tokenizer=tokenizer, stop_words=stop_words,
token_pattern=token_pattern, ngram_range=ngram_range, analyzer=analyzer, max_df=max_df,
min_df=min_df, max_features=max_features, vocabulary=vocabulary, binary=binary, dtype=dtype)
# self._trained = False
self.min_fuzzy_score = min_fuzzy_score
@staticmethod
def remove_non_alphanumeric_chars(s: str) -> 'str':
pass
@staticmethod
def tokenize_text(s: str) -> 'List[str]':
pass
def fuzzy_repair(self, sl: 'List[str]') -> 'List[str]':
pass
def fit(self, raw_documents, y=None):
print('Running FuzzyTokenizer Fit')
#TODO clean up input
super().fit(raw_documents=raw_documents, y=y)
self._trained = True
return self
def transform(self, raw_documents):
print('Running Transform')
#TODO clean up input
#TODO fuzzyrepair
return super().transform(raw_documents=raw_documents)