What is the correct svmlight input format in Mallet?

Question

I am using Mallet with the SVMLight input format to do classification usingNaiveBayes classifier. But I get a NumberFormatException. I'm wondering how I can use strings features when using SVMLight. As I read in the guideline 1, the features can also be strings.

Can anyone help me what is wrong with my code or input?

Here is my code:

public void trainMalletNaiveBayes() throws Exception {

        ArrayList<Pipe> pipes = new ArrayList<Pipe>();
        pipes.add(new SvmLight2FeatureVectorAndLabel());
        pipes.add(new PrintInputAndTarget());

        SerialPipes pipe = new SerialPipes(pipes);

        //prepare training instances
        InstanceList trainingInstanceList = new InstanceList(pipe);

        trainingInstanceList.addThruPipe(new CsvIterator(new FileReader("/tmp/featureFiles_svm.csv"), "^(\\S*)[\\s,]*(.*)$", 2, 1, -1));

        //prepare test instances
        InstanceList testingInstanceList = new InstanceList(pipe);
        testingInstanceList.addThruPipe(new CsvIterator(new FileReader("/tmp/test_set.csv"), "^(\\S*)[\\s,]*(.*)$", 2, 1, -1));

        ClassifierTrainer trainer = new NaiveBayesTrainer();
        Classifier classifier = trainer.train(trainingInstanceList);

And here is the first three lines of my input file:

No f1:NP f2:NN f3:1 f4:1 f5:0 f6:0 f7:0 f8:0.0 f9:1 f10:true f11:false f12:false f13:false f14:false f15:ROOT f16:NN f17:NOTHING
No f1:NP f2:NN f3:8 f4:4 f5:0 f6:0 f7:1 f8:4.127134385045092 f9:8 f10:true f11:false f12:false f13:false f14:false f15:ROOT f16:DT f17:NOTHING
Yes f1:NP f2:NN f3:4 f4:3 f5:0 f6:0 f7:0 f8:0.0 f9:4 f10:true f11:false f12:false f13:false f14:false f15:NP f16:DT f17:NN

The first column is the label of the instance and there rest of the data includes the features and their values. For example, NN shows the POS of the head word of a phrase.

In the meantime, I get the exception for the NN (NumberFormatException: For input string: "NN") . I'm wondering why it doesn't have any problem with the NP which comes before that, but stops at the NN.

score 1 · Accepted Answer · answered Sep 22 '17 at 09:04

1

All features need to have numeric values. For booleans you can use true=1 and false=0. You would also have to modify f1:NP to f1_NP=1.

The reason it's not dying on the NP is that the SvmLight2FeatureVectorAndLabel class is expecting to parse an entire line (label and data), but the code is reading the file with a CsvIterator that is splitting off the first element as a label.

The classify.tui.SvmLight2Vectors class uses this code for an iterator:

new SelectiveFileLineIterator (fileReader, "^\\s*#.+")

answered Sep 22 '17 at 09:04

David Mimno

1,836
7
7

Thanks for your reply. And should I add all other features with a zero value to the line. For example, when I have an NP value for a feature, it means that it is not VP, S, FRAG, etc. Should I also add f2_VP:0, f3_S:0, etc.?I mean, should I convert my categorial features to numerical features? Then, I will have a really sparse feature vector. Right? – user1419243 Sep 22 '17 at 09:41
Convert categories to features, leave out anything that's got a zero value and it will be handled efficiently. – David Mimno Sep 22 '17 at 13:27
Thanks. It works with no error now :) Just another question, with the above format and the written code, I get name: csvline:1 target: f1_NP:1 input: f2(0)=0.0 f3(1)=0.0 f4(2)=2.65 ... It seems that it is not reading my target correctly and taking a feature as the target. Is my code or input format somewhere wrong or the PrintInputAndTarget() does not work with SVMLight, and is just for the other format? – user1419243 Sep 22 '17 at 13:35
Oh, sorry, I had't understood your answer correctly. I solved my problem with changing CSVIterator to SelectiveFileLineIterator and now it works perfectly :) – user1419243 Sep 22 '17 at 14:35

What is the correct svmlight input format in Mallet?

1 Answers1