Is there any tool available with which i can prune high frequency and low frequency terms from my dataset ?
Asked
Active
Viewed 729 times
0
-
Multiple tools are available. Which programming language are you using? – Chthonic Project Feb 01 '14 at 15:35
-
@ChthonicProject i am working in java – Kashif Khan Feb 01 '14 at 16:25
2 Answers
0
A commonly used algorithm for this would be Grubbs' test. I don't really know an implementation in Java but if you would be willing to do the preprocessing in a different language, then there is the outliers package in R containing amongst others the Grubbs' test. To eliminate multiple outliers you can just repeatedly apply Grubbs' test.
Edit:
I just saw that I missed the text classification tag. If you just want to keep too frequent terms from skewing your results, maybe TF-IDF could be interesting to you. This of course does not reduce dimensionality.

DoHe
- 73
- 1
- 4
-
thanx, i am working in weka, it provides all functionalities however i was not sure about its pruning that is why i asked for a tool to only preprocess the documents to analyse the pruning results. – Kashif Khan Feb 03 '14 at 13:54
-
As far as I know Weka is a well established data mining tool. I think there is little reason to distrust their pruning. Is there something odd with their results that causes your distrust? – DoHe Feb 03 '14 at 17:10
-
well weka is a very good tool but i don't understand a few things(may be because i am new to it). for e.g. in StringToWordVector class the function setWordsToKeep() we select words to keep in vocabulary but how would i know that how many features or words are in my dataset ? i.e. how would a know number of features/words in dataset of 20K docs in order to fill in setWordsToKeep function of StringToWordVector ? – Kashif Khan Feb 03 '14 at 18:47
-
Taken from [here](http://weka.sourceforge.net/doc.dev/weka/filters/unsupervised/attribute/StringToWordVector.html) there should be an option (-M) to specify minimum term frequency. Furthermore, -stopwords supports the usage of stopwords or -S for a default list of stopwords. – DoHe Feb 07 '14 at 16:54
0
Stop words are a common technique to eliminate (very) high frequency words in natural language processing.
Low-frequency words are usually interesting. Do you actually want to eliminate them?

phs
- 10,687
- 4
- 58
- 84
-
i have seen in papers some authors prune low frequency words i.e. words that occur in only 3 documents in entire datasets etc. – Kashif Khan Feb 03 '14 at 13:50
-