Corpus file format for Moses

Question

I'm using Moses to make a Language model.

I followed the instructions from this link: Baseline System: Moses

I have google 1-gram file that looks like:

</S>    95119665584
<S>     95119665584
,       30578667846
.       22077031422
<UNK>   21594821357
the     19401194714
-       16337125274
of      12765289150
and     12522922536

That means that the word "of" appeared 12,765,289,150 times.

Now I want to make a Language Model from this file ("Build Language Model"),

I don't know if this file format will work with Moses.

In the tutorial they are working with "europarl-v6.en", but I can't find it on the web to check the file format.

LAST EDIT:

I need to represent each letter as word, so "hello" becomes "h e l l o".

After representing each word as I said , which format should I use?

Should it be:

o f
o f
o f
a n d
a n d

Or like the original format:

o f       12765289150
a n d     12522922536

Or maybe in other format ?

Does it still count as google n-gram ?

I followed the link: How can I use the Google Web N-gram corpus to build an LM as @ MukundKRoy suggested, but I don't know how to use it in my case (1-gram,2-gram...my new file isn't const).

I'll be glad if someone can tell me what format should this file be to use it with SRILM as simple as I can. Thanks

score 1 · Accepted Answer · answered Jan 21 '13 at 15:08

SRILM is taking care of the 1-2-3..-grams, don't bother.

I've done something similar, take a look over here:

Moses Installation and Training Run-Through

In PART II - Build a Model , section Build Language Model , it is working perfect with google n-grams.

Let me know if that worked for you.

score 0 · Answer 2 · answered Jan 17 '13 at 03:42

0

You can use CMU-Cambridge Statistical Language Modeling Toolkit to build your language model. Refer wfreq2vocab and text2wngram. I think this format of LM will work fine with moses.

answered Jan 17 '13 at 03:42

Mukund K Roy

175
1
8

Thanks, but I have to use Moses, do you know what is the format of the file? – Guy P Jan 17 '13 at 05:47
Moses can use both SRILM and IRSTLM and I use SRILM. Since you have only unigram data so there will be no backoff weight.So as per SRILM format you need to have "probability" "unigram" and third column blank. Have a try on this... – Mukund K Roy Jan 17 '13 at 13:21
I;m using SRILM too, please look at my fist edited post. Do you know witch format SRILM working with ? – Guy P Jan 17 '13 at 13:56
1

If you are using SRILM, then go for the process mentioned at http://www-speech.sri.com/projects/srilm/manpages/srilm-faq.7.html Section B6. Try it and tell me. – Mukund K Roy Jan 17 '13 at 14:11

Corpus file format for Moses

LAST EDIT:

2 Answers2