is it possible to read my own data training / data set in this code?

Question

I need help. I already ask this question to the owner, but not yet answered. Would someone tell me is it possible to change the parameter in this part, Please? Im totally just starting to learn python with NLTK, I havent tried to do the customization. The purpose is I want to use this awesome MaxEnt script made by Arne Neumann to analyze Indonesian language. I already have the data set.

if corpus.lower() == "brown":
    from nltk.corpus import brown
    tagged_sents = brown.tagged_sents()[:num_sents]
elif corpus.lower() == "treebank":
    from nltk.corpus import treebank
    tagged_sents = treebank.tagged_sents()[:num_sents]
else:
    print "Please load either the 'brown' or the 'treebank' corpus."

is it possible to modify the given parameter of corpus to another document? i planning to use Indonesian document filled with tweets. So far, i got data set of Indonesian words ( https://github.com/drr3d/BimaNLP/tree/master/dataset ). Can this maxent-pos-tagger work same as given dataset? Thank you very much!

Yes, it is possible to read and use your own data. You just use the appropriate nltk "corpus reader", which see. But to use the tagger you found on Indonesian, you first need to train it on Indonesian data that is already *tagged*. Do you have a tagged corpus of Indonesian? — alexis, Dec 01 '16 at 08:36
thanks for your response. Yea , i already have it.Like this right? https://github.com/drr3d/BimaNLP/blob/master/dataset/tb_tagged_katadasar.txt — Fregy, Dec 01 '16 at 11:08
Pumpkin, that's a POS dictionary not a corpus. :-( You need full sentences to train a tagger. Lots of words are ambiguous, and can only be tagged in context-- that's why we need taggers in the first place. Take a look at `nltk.corpus.brown.raw()[:1000]` to see what you need to have. — alexis, Dec 01 '16 at 13:26
sorry for my late response, i got some problem with my home connection few days ago. Oh i see, so i must gather all of indonesian tweet into one document? and then input it as a corpus to train the tagger? So far what i got from your explanation is (processing flow) : IndonesianTweetst.txt --> Maxent Tagger -> POSDictionaryfromtweet.txt ? anyway, big thanks for you explanation. Its really help me a lot to study more about my project. — Fregy, Dec 05 '16 at 10:33
A corpus can consist of many files. But you need **tagged** text to train a tagger. Do a little reading and come back to the site when you have started writing code. — alexis, Dec 05 '16 at 10:36

is it possible to read my own data training / data set in this code?

0 Answers0