I am trying to incorporate the mallet package into my java code for my sequence labeling task. However, I am not very sure how should I do it with just the data import guideline on the mallet website. Can anybody help me out of it?
My first question is about the import of sequence data. The only data format I see on the website is InstanceList, however, how should we describe sequences with the data structure. For example, if we have multiple sequences (A, B, C are the labels): S1: A B B B B A B B; S2: B A B B B C; S3: C B A B B B. How should I put them into the training data? An InstanceList for S1, an InstanceList for S2 and an InstanceList for S3? And then how do I put them altogether as training data?
My second question is about how to set the features into the instances. I already have the feature weights and the labels, so is there an easy way for me to set the instances? For example, I have features [0.1, 0.2, 0.5, 0.4, 0.1] for an item in the sequence and its label as B, how can I set the features into the Instance structure without going through the multiple pipeline process?
Besides, I am planning to use the CRF model for my sequence labeling task. Besides the labels, I would also want to have the probability of the whole sequence. Is it possible for me to get the information? I saw something like this on the website:
double logScore = new SumLatticeDefault(crf,inputSeq,outputSeq).getTotalWeight();
double logZ = new SumLatticeDefault(crf,inputSeq).getTotalWeight();
double prob = Math.exp(logScore - logZ);
Is this one going to do what I want? And if yes, what would be the inputSeq and outputSeq here?