0

Is there any tool available with which i can prune high frequency and low frequency terms from my dataset ?

Kashif Khan
  • 301
  • 6
  • 17

2 Answers2

0

A commonly used algorithm for this would be Grubbs' test. I don't really know an implementation in Java but if you would be willing to do the preprocessing in a different language, then there is the outliers package in R containing amongst others the Grubbs' test. To eliminate multiple outliers you can just repeatedly apply Grubbs' test.

Edit:

I just saw that I missed the text classification tag. If you just want to keep too frequent terms from skewing your results, maybe TF-IDF could be interesting to you. This of course does not reduce dimensionality.

DoHe
  • 73
  • 1
  • 4
  • thanx, i am working in weka, it provides all functionalities however i was not sure about its pruning that is why i asked for a tool to only preprocess the documents to analyse the pruning results. – Kashif Khan Feb 03 '14 at 13:54
  • As far as I know Weka is a well established data mining tool. I think there is little reason to distrust their pruning. Is there something odd with their results that causes your distrust? – DoHe Feb 03 '14 at 17:10
  • well weka is a very good tool but i don't understand a few things(may be because i am new to it). for e.g. in StringToWordVector class the function setWordsToKeep() we select words to keep in vocabulary but how would i know that how many features or words are in my dataset ? i.e. how would a know number of features/words in dataset of 20K docs in order to fill in setWordsToKeep function of StringToWordVector ? – Kashif Khan Feb 03 '14 at 18:47
  • Taken from [here](http://weka.sourceforge.net/doc.dev/weka/filters/unsupervised/attribute/StringToWordVector.html) there should be an option (-M) to specify minimum term frequency. Furthermore, -stopwords supports the usage of stopwords or -S for a default list of stopwords. – DoHe Feb 07 '14 at 16:54
0

Stop words are a common technique to eliminate (very) high frequency words in natural language processing.

Low-frequency words are usually interesting. Do you actually want to eliminate them?

phs
  • 10,687
  • 4
  • 58
  • 84