-2

I am using linear svc (scikit learn) for classification into news categories such as sports,health,world,technology,lifestyle. Now for a given piece of text suppose it has : 1) 5 occurrences of word windows and 3 occurrences of word machine it not classifying into technology but if i use same text and increase occurrences of windows to 12 and machine to 10 it gets classified into technology.

So is there a way to increase importance of any relevant word of class in linear svc ?

1 Answers1

0

You are basically looking for TF-IDF . Here TF stands for Term frequency, i.e (Count of a term in a document)/(Total Number of terms in a document). This will help you to get the most frequent terms in a document. However, it might be the case that some terms which occur less frequently might be more important for classification ( or say have more weightage for classificiation). In that case, you include Inverse Document Frequency (IDF). It is calculated as log(Total documents/(Number of documents containing a certain term, say 'x')

Then finally you multiply Tf*IDF value to get the TF-IDF of the term.

Here is short example at this link.

Here is an example using scikit-learn

References:

Gambit1614
  • 8,547
  • 1
  • 25
  • 51
  • I am already using TFIDF vectorizer but i need to increase weightage of some specific term, is it possible to explicitly increase the importance of some words ? – Shubham Garg Jan 09 '18 at 09:29
  • I think you need to modify different parameters of the classifier instead of modifying the data itself using grid-search, etc. I can explain it better if you could post your code . – Gambit1614 Jan 09 '18 at 10:22