How to include frequency factor in Linear SVC?

Question

I am using linear svc (scikit learn) for classification into news categories such as sports,health,world,technology,lifestyle. Now for a given piece of text suppose it has : 1) 5 occurrences of word windows and 3 occurrences of word machine it not classifying into technology but if i use same text and increase occurrences of windows to 12 and machine to 10 it gets classified into technology.

So is there a way to increase importance of any relevant word of class in linear svc ?

score 0 · Answer 1 · answered Jan 09 '18 at 07:38

You are basically looking for TF-IDF . Here TF stands for Term frequency, i.e (Count of a term in a document)/(Total Number of terms in a document). This will help you to get the most frequent terms in a document. However, it might be the case that some terms which occur less frequently might be more important for classification ( or say have more weightage for classificiation). In that case, you include Inverse Document Frequency (IDF). It is calculated as log(Total documents/(Number of documents containing a certain term, say 'x')

Then finally you multiply Tf*IDF value to get the TF-IDF of the term.

Here is short example at this link.

Here is an example using scikit-learn

References:

I am already using TFIDF vectorizer but i need to increase weightage of some specific term, is it possible to explicitly increase the importance of some words ? — Shubham Garg, Jan 09 '18 at 09:29
I think you need to modify different parameters of the classifier instead of modifying the data itself using grid-search, etc. I can explain it better if you could post your code . — Gambit1614, Jan 09 '18 at 10:22

How to include frequency factor in Linear SVC?

1 Answers1