1

i used TfidfVectoriser to create tf-idf matrix

from sklearn.feature_extraction.text import TfidfVectorizer
transformer = TfidfVectorizer(smooth_idf=False,stop_words=stopwords.words('english'))   
tfidf = transformer.fit_transform(raw_documents=sentences)

Now I want to transform each element of my sentences list to list of tokens, which were used in tfidfvectoriser. I tried to extract it directly from tfidf object via this function

def get_token(id, transformer_name, tfidf_obj):
   return(np.array(transformer_name.get_feature_names())[\
      tfidf_obj[id].toarray().reshape((tfidf_obj.shape[1],))!=0])

Where id is index of each sentence. In this function i tried to extract given row from tf-idf object, find non-zero elements and extract corresponding elements from transformer_name.get_feature_names() . Looks too complex =). Also this solution works very slow =/

Is there any way to get tokens using tfidfvectorizer's preprocessing and tokenization functions?

Slavka
  • 1,070
  • 4
  • 13
  • 28

0 Answers0