fast way to tokenize data in Python like TfidfVectorizer does

Asked Apr 16 '17 at 20:47

Active Apr 16 '17 at 20:47

Viewed 720 times

i used TfidfVectoriser to create tf-idf matrix

from sklearn.feature_extraction.text import TfidfVectorizer
transformer = TfidfVectorizer(smooth_idf=False,stop_words=stopwords.words('english'))   
tfidf = transformer.fit_transform(raw_documents=sentences)

Now I want to transform each element of my sentences list to list of tokens, which were used in tfidfvectoriser. I tried to extract it directly from tfidf object via this function

def get_token(id, transformer_name, tfidf_obj):
   return(np.array(transformer_name.get_feature_names())[\
      tfidf_obj[id].toarray().reshape((tfidf_obj.shape[1],))!=0])

Where id is index of each sentence. In this function i tried to extract given row from tf-idf object, find non-zero elements and extract corresponding elements from transformer_name.get_feature_names() . Looks too complex =). Also this solution works very slow =/

Is there any way to get tokens using tfidfvectorizer's preprocessing and tokenization functions?

asked Apr 16 '17 at 20:47

Slavka

1,070
4
13
28

http://www.cs.duke.edu/courses/spring14/compsci290/assignments/lab02.html – Veronica Cheng Aug 08 '17 at 10:14
This might be useful for you? – Veronica Cheng Aug 08 '17 at 10:14

fast way to tokenize data in Python like TfidfVectorizer does

0 Answers0