-1

I have a huge file (corpus) which includes words and their POS Tags but also some unrelated information in-between which I want to delete. Unrelated information consists only of some number of characters. And 1 space is used to distinguish words-irrelevant informations-POS Tags . Specifically each word in a sentence are split by a newline and sentences are split by two newlines. It has the following format:

My RRT PRP
Name DFEE NN
is  PAAT VBZ
Selub KP NNP
. JUM .   

Sentence_2

I want to keep the information in this file as an array of sentences where each sentence is an array of words. As follows:

[[('My', 'PRP'), ('name', 'NN'), ('is', 'VBZ'), ('Selub.', 'NNP'), ('.', '.')], ...]

As a beginner of Python, I will appreciate any help.

selubamih
  • 83
  • 3
  • 15
  • Welcome to SO. Unfortunately this isn't a discussion forum or tutorial service. Please take the time to read [ask] and the other links found on that page. Invest some time working your way through [the Tutorial](https://docs.python.org/3/tutorial/index.html), practicing the examples. It will give you an introduction to the tools Python has to offer and you may even start to get ideas for solving your problem.[Why “Can someone help me?” is not an actual question?](https://meta.stackoverflow.com/questions/284236/why-is-can-someone-help-me-not-an-actual-question). – wwii Apr 17 '18 at 02:36

1 Answers1

1

I split your sentence into two so we can see the split in the output

My RRT PRP
Name DFEE NN

is  PAAT VBZ
Selub KP NNP
. JUM . 

We can use a generator that yields lists to divide our sentences:

def splitter(lines):
    sentence = []
    for line in lines:
        if not line.strip():  # empty line
            if not sentence:  # blanks before sentences
                continue
            else:  # about to start new sentence
                yield sentence
                sentence = []
        else:
            word, _, tag = line.split()  # Split the line
            sentence.append((word, tag))  # Add to current sentence
    yield sentence  # Yield the last sentence

with open('infile.txt') as f:
    list_of_sentences = list(splitter(f))  # consume the generator into a list
    print(list_of_sentences)
    # [[('My', 'PRP'), ('Name', 'NN')], [('is', 'VBZ'), ('Selub', 'NNP'), ('.', '.')]]
Patrick Haugh
  • 59,226
  • 13
  • 88
  • 96