I have a huge file (corpus) which includes words and their POS Tags but also some unrelated information in-between which I want to delete. Unrelated information consists only of some number of characters. And 1 space is used to distinguish words-irrelevant informations-POS Tags . Specifically each word in a sentence are split by a newline and sentences are split by two newlines. It has the following format:
My RRT PRP
Name DFEE NN
is PAAT VBZ
Selub KP NNP
. JUM .
Sentence_2
I want to keep the information in this file as an array of sentences where each sentence is an array of words. As follows:
[[('My', 'PRP'), ('name', 'NN'), ('is', 'VBZ'), ('Selub.', 'NNP'), ('.', '.')], ...]
As a beginner of Python, I will appreciate any help.