4

Almost all of the examples are based on numbers. In text documents i have words instead of numbers.

So can you show me simple examples of how to use these algorithms for text documents classification.

I don't need code example but just logic

Pseudocode would help greatly

Furkan Gözükara
  • 22,964
  • 77
  • 205
  • 342
  • Just a quick question. When you mean that most examples are based on numbers you are referring that the elements (documents in your case) are represented as a vector such as (1, 0.77, 0.4, ...), right? – miguelmalvarez May 22 '13 at 14:38

3 Answers3

9

The common approach is to use a bag of words model (http://en.wikipedia.org/wiki/Bag_of_words_model) where the classifier would learn the presence of words in a text, it is simple but works surprisingly well.

Also, here there is a similar question: Prepare data for text classification using Scikit Learn SVM

Community
  • 1
  • 1
Pedrom
  • 3,823
  • 23
  • 26
  • i suppose this may be very inefficient since there may be hundreds thousands of words am i incorrect ? – Furkan Gözükara May 22 '13 at 19:06
  • @MonsterMMORPG Not necessarily as not all the words have the same relevancy you might want to ignore short words (less than three characters) and maybe the very longer (> 10) and less frequency ones. Also a vector of 400 - 600 words should be fine and give you decent performance – Pedrom May 22 '13 at 19:11
  • 1
    What @Pedrom has described is called feature selection, where you select the most representative terms. The specific method he explains is feature selection based on document frequency, which is a very simple (although very powerful) way of limiting the information you process in order to increase efficiency and, in some cases, effectiveness (quality). However, I disagree referring to the number of features. It depends largely on the collection but I would say that you will need between 1000 and 3000 features for best performance, and i advise you to try several configurations. – miguelmalvarez May 23 '13 at 08:57
  • [This](http://faculty.cs.byu.edu/~ringger/Winter2007-CS601R-2/papers/yang97comparative.pdf) is a very nice paper comparing and explaining different feature selection metrics for text classification. You can also check [Sebastiani's](http://nmis.isti.cnr.it/sebastiani/Publications/ACMCS02.pdf) survey on text classification for extended information about classification in general, and feature selection in particular. – miguelmalvarez May 23 '13 at 08:59
  • all thanks. i coded knn and it works not so bad. i will try this too. – Furkan Gözükara May 23 '13 at 11:45
  • 1
    @miguelmalvarez Nice comment Miguel, I am very agree with what you said, I just wanted to give a lower bound regarding the number of features, depending of the requirement and the domain of the problem you might need that many features. – Pedrom May 23 '13 at 13:21
3

You represent the terms that appear in documents as a weight in a vector, where each index position is the "weight" of a term. For instance, if we assume a document "hello world", and we associated position 0 with the importance of "hello" and position 1 with the importance of world, and we measure the importance as the number of times the term appears, the document is seen as d = (1, 1).

At the same time a document saying only "hello" would be (1, 0).

This representation could be base in any measure for the importance of terms in documents being the term frequency (as suggested by @Pedrom) the simplest option. The most common, yet simple enough, technique is to apply TF-IDF which combines how common a term is in the document and how rare is in the collection.

I hope this helps,

miguelmalvarez
  • 920
  • 6
  • 15
0

In bag of words model you you can use the term frequencies and assign weights to them according to their occurence in the new document and the training document. After that you can use the similarity function to calculate the similarity between the training and test documents.

KHALID
  • 1