from nltk.corpus.reader.conll import ConllCorpusReader
READER = ConllCorpusReader(root="./", fileids=".conll",
columntypes=('words','pos','tree','chunk','ne','srl','ignore')
)
READER_sents(myConLLfile)
I'm extracting sentence as a list of strings from a .conll
file. The code above doesn't report any errors, so I think something is extracted for each sentence. Yet when I try to print out or add POS-tags to every sentence, the Value Error below occurs for every sentence after the 1007th one.
- What's happening? Is there a way to see those extracted but ill-structured sentences?
- How can I extract the sentences properly? I guess some tokens are represented as a tuple of string and OBI instead of string. But then it's weird to have the same error report for many sentences.
- Worse case, can I only extract those sentences in good structure?
i = 0
for sentence in READER_sents(myConLLfile):
print(i)
print(sentence)
i += 1
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-125-9c03d8d69ec0> in <module>()
1 i = 0
----> 2 for sentence in READER.sents(myConLLfile):
3 print(i)
4 print(sentence)
5 i += 1
2 frames
/usr/local/lib/python3.6/dist-packages/nltk/corpus/reader/conll.py in _read_grid_block(self, stream)
206 if len(row) != len(grid[0]):
207 raise ValueError('Inconsistent number of columns:\n%s'
--> 208 % block)
209 grids.append(grid)
210 return grids
ValueError: Inconsistent number of columns:
This O
guy O
needs O
his O
own O
show O
on O
Discivery B-corporation
Channel I-corporation
! O