-1

I'm trying to use scikit applied to Natural Language Processing and I'm starting by reading some tutorials. I've found this one http://www.markhneedham.com/blog/2015/02/15/pythonscikit-learn-calculating-tfidf-on-how-i-met-your-mother-transcripts/ which explains how to get tfidf scores from a set of documents.

But I have a question, TF-IDF is supposed to depend from a term, the document of that term and the collection of all documents to be analyzed.

So, for example. In a collection of two documents, A and B, the term 'horse' should get a different TF-IDF score if we compute TF-IDF using document A than the same term but by analyzing term frequency from document B.

How can I compute TF-IDF of a term in respect of a specific document using scikit?

1 Answers1

0

In tutorial wich you mentioned TF-IDF is calculated as:

tfidf_matrix =  tf.fit_transform(corpus)

Quote: "if we look at tfidf_matrix we’d expect it to be a 208 x 498254 matrixone row per episode, one column per phrase". So, TF-IDF of each phrase is different for each episode (text) in this matrix. As you expected.

Matrix element tfidf_matrix[document,phrase] is TF-IDF value for each particular phrase in particular document of a corpus (all documents).

CrazyElf
  • 763
  • 2
  • 6
  • 17