How to prune low frequency and high frequency words from a dataset?

Question

Is there any tool available with which i can prune high frequency and low frequency terms from my dataset ?

Multiple tools are available. Which programming language are you using? — Chthonic Project, Feb 01 '14 at 15:35

DoHe · Accepted Answer · 2014-02-02T23:38:07.157

0

A commonly used algorithm for this would be Grubbs' test. I don't really know an implementation in Java but if you would be willing to do the preprocessing in a different language, then there is the outliers package in R containing amongst others the Grubbs' test. To eliminate multiple outliers you can just repeatedly apply Grubbs' test.

Edit:

I just saw that I missed the text classification tag. If you just want to keep too frequent terms from skewing your results, maybe TF-IDF could be interesting to you. This of course does not reduce dimensionality.

edited Feb 02 '14 at 23:38

answered Feb 02 '14 at 21:47

DoHe

73
1
4

thanx, i am working in weka, it provides all functionalities however i was not sure about its pruning that is why i asked for a tool to only preprocess the documents to analyse the pruning results. – Kashif Khan Feb 03 '14 at 13:54
As far as I know Weka is a well established data mining tool. I think there is little reason to distrust their pruning. Is there something odd with their results that causes your distrust? – DoHe Feb 03 '14 at 17:10
well weka is a very good tool but i don't understand a few things(may be because i am new to it). for e.g. in StringToWordVector class the function setWordsToKeep() we select words to keep in vocabulary but how would i know that how many features or words are in my dataset ? i.e. how would a know number of features/words in dataset of 20K docs in order to fill in setWordsToKeep function of StringToWordVector ? – Kashif Khan Feb 03 '14 at 18:47
Taken from [here](http://weka.sourceforge.net/doc.dev/weka/filters/unsupervised/attribute/StringToWordVector.html) there should be an option (-M) to specify minimum term frequency. Furthermore, -stopwords supports the usage of stopwords or -S for a default list of stopwords. – DoHe Feb 07 '14 at 16:54

score 0 · Answer 2 · answered Feb 02 '14 at 21:53

0

Stop words are a common technique to eliminate (very) high frequency words in natural language processing.

Low-frequency words are usually interesting. Do you actually want to eliminate them?

answered Feb 02 '14 at 21:53

phs

10,687
4
58
84

i have seen in papers some authors prune low frequency words i.e. words that occur in only 3 documents in entire datasets etc. – Kashif Khan Feb 03 '14 at 13:50
Fair enough. Keeps the vocabulary count down I suppose. – phs Feb 03 '14 at 18:40

How to prune low frequency and high frequency words from a dataset?

2 Answers2