0

I am a newbies to Mallet, I am trying use mallet Simple tagger/CRF and experimenting with phrases - I tried lookup the documentation on mallet site and also went through the user archives - nothing helped.

I tried training mallet for simple tagging, Its works resonable well.. Here is how my data looks like (Pls note there is a newline between the training to indicate they are different set)

Sample training data:

where STOPWORD
is STOPWORD
chicago CITY
<---Newline---->
Sunnyvale CITY
<---Newline---->
Chicago CITY
<---Newline---->
Washington CITY
<---Newline---->
What STOPWORD
is STOPWORD
Sunnyvale CITY
time ASK
<---Newline---->
new STOPWORD
<---Newline---->    
place STOPWORD 

The problem I have is when city names are multi words, Say

new york CITY

Pls note that in the above training data "new" is a STOPWORD Questions

  1. For Simple tagger, Is the above representation fine ? If not how do I represent pharses ?
  2. If not how to represent data such that SimpleTagger/CRF can use the previous 'n' words to arrive at a tag ? i.e kind of chunk my input
demongolem
  • 9,474
  • 36
  • 90
  • 105
rtuser
  • 33
  • 1
  • 5

1 Answers1

1

As far as I know, the format you have used for multi word expressions is not correct. According to here, the format of the input is featre1 feature2 feature3 ....

So, in your case, New is feature1, York is feature 2, etc.

I suggest to use New_York to have your multi word expressions as one word.

Meanwhile, you should notice that you don't have to include the words themselves in the input data. If you do so, they are treated as the first feature. So, if "the word text" or "word lemma" is not an interesting feature to you, throw it out of your input data.

user1419243
  • 1,655
  • 3
  • 19
  • 33