Reading bigfiles in NLTK Corpus Reader

Asked Mar 03 '16 at 20:41

Active Mar 04 '16 at 03:29

Viewed 119 times

I am trying to create tagged corpus with from

nltk.corpus.reader import TaggedCorpusReader
reader = TaggedCorpusReader('.', r'.*\.pos')

it is working fine. But it seems it is showing each .pos file as a sentence, but one file may contain multiple lines. How may I get them as separate lines?

Please suggest the error I am making. I am on Python2.x with NLTK3.1 with MS-Windows.

I am trying to experiment with small files with multiline like, part/NN of/PP speech/NN tagging/NN is/AV the/DT process/NN of/PP identifying/NN the/DT part/NN of/PP speech/NN tag/NN for/PP a/DT word/NN most/JJ of/PP the/DT time/NN a/DT tagger/NN must/NN first/ADJ be/ADJ trained/VB on/PRP a/DT training/NN corpus/NN ./.How/WH to/PP train/NN and/CONJ use/VV a/DT tagger/NN is/AV covered/VB in/PRP detail/NN in/JJ chapter/NN 4/NN part/NN of/ADJ speech/NN tagging/NN but/PRP first/ADJ we/PRP must/JJ know/VB how/WH to/PRP create/VB and/CONJ use/VB a/DT training/ADJ corpus/NN of/PRP part/NN of/PRP speech/NN tagged/NN words/NN.

I checked brown corpus raw format and tried to give \n between two lines but not helping.

edited Mar 04 '16 at 03:29

asked Mar 03 '16 at 20:41

Coeus2016

Can you post a snippet of your input files? – alvas Mar 03 '16 at 22:23

Reading bigfiles in NLTK Corpus Reader

0 Answers0