i used TfidfVectoriser to create tf-idf matrix
from sklearn.feature_extraction.text import TfidfVectorizer
transformer = TfidfVectorizer(smooth_idf=False,stop_words=stopwords.words('english'))
tfidf = transformer.fit_transform(raw_documents=sentences)
Now I want to transform each element of my sentences
list to list of tokens, which were used in tfidfvectoriser. I tried to extract it directly from tfidf
object via this function
def get_token(id, transformer_name, tfidf_obj):
return(np.array(transformer_name.get_feature_names())[\
tfidf_obj[id].toarray().reshape((tfidf_obj.shape[1],))!=0])
Where id
is index of each sentence. In this function i tried to extract given row from tf-idf object, find non-zero elements and extract corresponding elements from transformer_name.get_feature_names()
. Looks too complex =). Also this solution works very slow =/
Is there any way to get tokens using tfidfvectorizer's preprocessing and tokenization functions?