Adding New Text to Sklearn TFIDIF Vectorizer (Python)

Question

Is there a function to add to the existing corpus? I've already generated my matrix, I'm looking to periodically add to the table without re-crunching the whole sha-bang

e.g;

articleList = ['here is some text blah blah','another text object', 'more foo for your bar right now']
tfidf_vectorizer = TfidfVectorizer(
                        max_df=.8,
                        max_features=2000,
                        min_df=.05,
                        preprocessor=prep_text,
                        use_idf=True,
                        tokenizer=tokenize_text
                    )
tfidf_matrix = tfidf_vectorizer.fit_transform(articleList)

#### ADDING A NEW ARTICLE TO EXISTING SET?
bigger_tfidf_matrix = tfidf_vectorizer.fit_transform(['the last article I wanted to add'])

maxymoo · Accepted Answer · 2018-01-02T03:30:50.863

16

You can access the vocabulary_ attribute of your vectoriser directly, and you can access the idf_ vector via _tfidf._idf_diag, so it would be possible to monkey-patch something like this:

import re 
import numpy as np
from scipy.sparse.dia import dia_matrix
from sklearn.feature_extraction.text import TfidfVectorizer

def partial_fit(self, X):
    max_idx = max(self.vocabulary_.values())
    for a in X:
        #update vocabulary_
        if self.lowercase: a = a.lower()
        tokens = re.findall(self.token_pattern, a)
        for w in tokens:
            if w not in self.vocabulary_:
                max_idx += 1
                self.vocabulary_[w] = max_idx

        #update idf_
        df = (self.n_docs + self.smooth_idf)/np.exp(self.idf_ - 1) - self.smooth_idf
        self.n_docs += 1
        df.resize(len(self.vocabulary_))
        for w in tokens:
            df[self.vocabulary_[w]] += 1
        idf = np.log((self.n_docs + self.smooth_idf)/(df + self.smooth_idf)) + 1
        self._tfidf._idf_diag = dia_matrix((idf, 0), shape=(len(idf), len(idf)))

TfidfVectorizer.partial_fit = partial_fit
articleList = ['here is some text blah blah','another text object', 'more foo for your bar right now']
vec = TfidfVectorizer()
vec.fit(articleList)
vec.n_docs = len(articleList)
vec.partial_fit(['the last text I wanted to add'])
vec.transform(['the last text I wanted to add']).toarray()

# array([[ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
#          0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
#          0.        ,  0.        ,  0.27448674,  0.        ,  0.43003652,
#          0.43003652,  0.43003652,  0.43003652,  0.43003652]])

edited Jan 02 '18 at 03:30

answered Aug 24 '16 at 04:41

maxymoo

35,286
11
92
119

Thank you for taking the time to answer. I'm trying to use this as a search index, using cosine_similarity to generate a list of results by relevance. It would be nice to not have to re-crunch my entire corpus every time I add a wish to add a new document. – Howard Zoopaloopa Aug 24 '16 at 05:08
2

Hey Howard, I worked out how to update the `idf_`, check out my edited answer – maxymoo Aug 24 '16 at 23:46
Awesome! Thank you for the great response! – Howard Zoopaloopa Aug 31 '16 at 16:41
I know this was a while back, but any reason for the last 3 lines of partial_fit, i.e. `self._tfidf._df_diag`, `print((len(idf), len(idf)))` and `print(vec._tfidf_idf_diag.shape`? – Dawid Laszuk Dec 30 '17 at 00:05
good catch @DawidLaszuk i must have left these in from debugging, i've removed them now – maxymoo Jan 02 '18 at 03:31
Great answer @maxymoo, thanks! How can I elaborate this if I want to add more sentences, and each time to add a few? For example I have 1000 sentences. At the first round I fit 250 and then add using 'partial_fit' 250 more, then at the next round I want to add the next 250 sentences etc. How can one build on this great code to do it? I tried getting the code in the answer into a for loop without much success. Thanks – ayalaall Jan 15 '20 at 12:48
Sorry to unearth this old post, yet I encounter the same requirement (being able to partial_fit a TfidfVectorizer to account for unseen tokens). Just a small remark about the (very useful) accepted answer: it does not extends the vocabulary with ngrams if you want it to account for them. I will do my best to improve, and will post a new answer if I succeed. – Pierre Massé May 21 '20 at 10:41
Hi @maxymoo, I've encountered a similar problem, but in which I need to calculate the similarity matrix of the documents, without re-doing the whole calculation all over again. I don't use idf, so it's possible. Can you possibly help with that to? [re-calculate-similarity-matrix-given-new-documents](https://stackoverflow.com/questions/64442720/re-calculate-similarity-matrix-given-new-documents) – Avihay Oct 20 '20 at 19:44

score 0 · Answer 2 · answered Jul 20 '23 at 08:36

I believe the given (excellent) answer has a couple of bugs - the document frequency should only be updated once if a token appears multiple times in one doc, and the vocabulary dict should be in a different order:

def _partial_fit(self, X:):
        X = X.copy()
        for doc in X:
            if self.lowercase:
                doc = doc.lower()
            tokens = re.findall(self.token_pattern, doc)
            tokens = [token for token in tokens if token not in my_stop_words]
            indices_to_insert = []
            for w in tokens:
                # We now need to update the vocabulary with the new tokens
                if w not in self.vocabulary_:
                    # temporary placeholder in the dict
                    self.vocabulary_[w] = -1
                    # create a list in alphabetical order
                    # each token's value in the dict is equal to its place in the list
                    # this aligns with the internal dict of sklearn's TfidfVectorizer
                    tmp_keys = sorted(list(self.vocabulary_.keys()))
                    # the dictionary must be in order it has seen the tokens
                    tmp_dict = {tmp_keys[i]: i for i in range(len(tmp_keys))}
                    # Include new tokens in vocab
                    self.vocabulary_ = {k: tmp_dict[k] for k in self.vocabulary_}
                    # Update number of features by 1 for data validation
                    self._tfidf.n_features_in_ += 1
                    # We keep a list of all new indices of new tokens
                    indices_to_insert.append(self.vocabulary_[w])

            # update document frequency
            doc_frequency = (self.n_docs + self.smooth_idf) / np.exp(
                self.idf_ - 1
            ) - self.smooth_idf
            # the new token indices must be added
            for index_to_insert in indices_to_insert:
                doc_frequency = np.insert(doc_frequency, index_to_insert, 0)
            self.n_docs += 1
            # document frequency is not dependent on number of times in doc, only if
            # it appears at all
            for w in set(tokens):
                doc_frequency[self.vocabulary_[w]] += 1

            # update internal inverse document frequency
            idf = (
                np.log(
                    (self.n_docs + self.smooth_idf) / (doc_frequency + self.smooth_idf)
                )
                + 1
            )

            # these values are updated to get correct values from the `transform`
            # function
            self._tfidf.idf_ = idf
            self._tfidf._idf_diag = dia_matrix((idf, 0), shape=(len(idf), len(idf)))

Adding New Text to Sklearn TFIDIF Vectorizer (Python)

2 Answers2

Linked