How to create the train file for svm light when a word occurs many times in a sentence

Question

I am using SVM-Light (at website http://svmlight.joachims.org/). I want to ask you an issue. I have a sentence "He is smart and he is a good student". This sentence is a positive sentence. When I create a list of word from this sentence, I will have a list with index of each word as follows: {1 - he, 2 - is, 3 - smart, 4 - and, 5 - a, 6 - good, 7 - student}. Then I rewrite sentence by index of words as follows: "1 2 3 4 1 2 5 6 7". And value of each word is "1:0.4 2:0.2 3:0.8 4:0.3 1:0.2 2:0.4 5:0.5 6:0.7 7:0.6" According to format of train file, index of words must be ordered by increasing index number, so I arrange as follows "1 1:0.4 1:0.2 2:0.2 2:0.4 3:0.8 4:0.3 5:0.5 6:0.7 7:0.6". However, I get an error "Features must be in increasing order !!!" when I run svm_learn. I noticed this error because my sentence have two "he" words and two "is" words. While your train file, each feature only appears one time. How should I solve this issue ? Do you explain for me ? I thank you very much.

score 0 · Answer 1 · answered May 14 '16 at 21:28

You can't have multiple values for the same feature. From what you wrote in your question, I think, the solution for this case would be just ignoring the fact that some words are seen twice, since essentially those words have different values in the sentence.

You can assume that you have as many features as many words you have in a sentence. That is your first feature will be the weight of the first word in the sentence, the second feature is the weight of the second word in the sentence, third is the weight of the third word in the sentence and so on. So for your example you will have a feature vector [1:0.4 2:0.2 3:0.8 4:0.3 5:0.2 6:0.4 7:0.5 8:0.7 9:0.6]. The problem with this approach is that different sentences will have different lengths. In this case SVMlight will assume that every sentence has the length of the longest one, and the rest of the values are simply zeros(this is essentially the idea behind providing indexes for features in your input: in case of sparse data this allows mentioning only the features that have non-zero values). Saying this, if the second sentence in your data happens to be 'He is not only smart but he is also a good student', the feature vector for the first sentence will be interpreted as: [1:0.4 2:0.2 3:0.8 4:0.3 5:0.2 6:0.4 7:0.5 8:0.7 9:0.6 10:0.0 11:0.0 12:0.0].

Another solution would be creating a 'dictionary' as you did, and then combining the values for the same word if the word is seen more than once in a sentence. This can be done by taking the max/min value, average, sum, product and so on. The way you might want to combine the features depend on the domain of the application. For example, if you decide to take the sum of all values for a word your feature vector for the dictionary {1 - he, 2 - is, 3 - smart, 4 - and, 5 - a, 6 - good, 7 - student} will be: 1:0.6 2:0.6 3:0.8 4:0.3 5:0.5 6:0.7 7:0.6

How to create the train file for svm light when a word occurs many times in a sentence

1 Answers1