Python file format for email classification with svm-light

Question

I am working with email subject, so I have 20 emails i want to classify, and a file with 20 lines - one line has one email subject.I have been working on it, but I am unable to figure out what the features refer to and the format of the input file for svmlight. Any tips to proceed will be helpful. Thanks in advance!

Edit: I have taken the tf-idf of the first 500 subject lines as a trial. However, according to svm-light format, we need:

<line> .=. <target> <feature>:<value> <feature>:<value> ... <feature>:<value> # <info>

I have only the tf-idf features for 500 lines. Sadly, this is not read by the svm-light as it needs features/value pair. Any ideas on what the value could be or how I can change the file in order to be read?

An idea of the file I have(first 5 email features):

1 201 1.0
2 280 0.123165672613
2 313 0.343915400191
2 515 0.157569797284
2 588 0.343915400191
2 652 0.343915400191
2 657 0.343915400191
2 774 0.23622904941
2 921 0.283118375032
2 1158 0.254849368195
2 1240 0.343915400191
2 1348 0.343915400191
2 1362 0.222321349873
3 57 0.342220321154
3 185 0.391349077827
3 244 0.391349077827
3 300 0.391349077827
3 693 0.391349077827
3 730 0.342220321154
3 1391 0.391349077827
4 57 0.342220321154
4 185 0.391349077827
4 244 0.391349077827
4 300 0.391349077827
4 693 0.391349077827
4 730 0.342220321154
4 1391 0.391349077827
5 32 0.323558487577
5 102 0.323558487577
5 157 0.364177022553
5 160 0.364177022553
5 718 0.151013895297
5 1171 0.364177022553
5 1277 0.323558487577
5 1308 0.364177022553
5 1336 0.364177022553

Please help!

What are you asking for? Are you trying to automatically generate subject lines for email messages? Are you trying to match emails with lines from those emails? Please give an example input and output and demonstrate that you have some understanding of Python. — dg99, Dec 27 '13 at 21:00
I have used nltk library to create the tf-idf of the subject lines. I have 1000 emails, whose subject I am using and have categorised the subjects. Currently, I want to use these 1000 subject lines to train the classifier, but I am unsure of how to proceed. Thanks for any help! — student001, Dec 28 '13 at 09:07

tripleee · Answer 1 · 2014-01-22T10:30:06.867

1

If you make a feature out of each word, create a list of all unique words w(1)..w(n). Now feature(i) gets the value 1 if w(i) exists in the sample you are examining. (You could also make the value be equal to the number of occurrences, so that a feature which occurs multiple times gets more weight.)

Assuming the following samples;

1 My hovercraft is full of eels
2 Your account is suspended
3 This is it!

... you could extract the following dictionary;

001 My
002 hovercraft
003 is
 :
 :
009 suspended
010 This
011 it!

(The leading zeros are just to make the features look different than the other numbers in this exposition. Normally there should probably not be any leading zeros.)

The features for sample 1 are 001 through 006; for sample 3 they are 010, 003, and 011. The other features get the value 0. So the full representation of sample 3 would look like

3 001:0 002:0 003:1 004:0 005:0 ...

(though I don't think you need to specify the zero, i.e. absent, features).

However, given the small sample size (just subjects), it's unlikely that you get very good results. Perhaps you'd be better off using e.g. bigram or trigram features (split each word using a sliding window; tri, rig, igr, gra, ram).

I don't think it makes sense to try to mix tf-idf with SVM, they are different approaches to the same fundamental problem.

edited Jan 22 '14 at 10:30

answered Dec 27 '13 at 21:13

tripleee

175,061
34
275
318

Hello. I have increased the sample size, so currently i have 1000 email subjects and their categories. I thought of taking tf-idf of the words and use it further. Does that sound right? Thanks for your help! – student001 Dec 28 '13 at 09:04
Better, but still small if you are restricting yourself to just the Subject line. Why are you ignoring the rest of the message? How do you plan to cope with an empty Subject? – tripleee Dec 28 '13 at 10:58
I am currently doing Topic Identification, and thus I am only using subject lines for now. The empty subject lines have been put into miscellaneous category. So, just to confirm, the tf-idf of the subject lines will be accepted as the input? I am doing it for the first time, and thus would rather confirm. – student001 Dec 28 '13 at 11:36
Hi, as mentioned, I have taken the tf-idf features of each line of subject. However, I do not know what the 'value' is referred to as in the input format, because of which it is not reading my file. Any ideas on this one? Any help would be very much appreciated! – student001 Jan 21 '14 at 20:07
Updated the answer slightly. It's unlikely that you will get new people to look at this by commenting on an old answer; maybe try posting a more specific new question if you are still feeling stuck. – tripleee Jan 21 '14 at 20:36
Ok sure. Thanks a lot! I get some idea for now :) – student001 Jan 22 '14 at 04:17

Python file format for email classification with svm-light

1 Answers1

Linked