Parsing newline delimited file

Question

I'm working on a project where I want to parse a text file using Python. The file consists of some data entry in formats of blocks that vary. A new entry is found when there is a new line. This is what I would like to accomplish:

Skip the first few lines (first 16 lines)
After the 16th line, there is a line break that starts the new data entry
Read the following lines until a new line break is hit. Each individual line is appended to a list called data.
The list will be passed to a function that handles further processing.
Repeat step 3 and 4 until there is no more data in the file

Here is an example of the file:

Header Info
More Header Info

Line1
Line2
Line3
Line4
Line5
Line6
Line7
Line8
Line9
Line10
Line11
Line12
Line13

MoreInfo    MoreInfo    MoreInfo    MoreInfo    MoreInfo
MoreInfo2   MoreInfo2   MoreInfo2   MoreInfo2   MoreInfo2   MoreInfo2
MoreInfo3   MoreInfo3   MoreInfo3   MoreInfo3   MoreInfo3
MoreInfo4   MoreInfo4
FieldName1  0001    0001
FieldName1  0002    0002
FieldName1  0003    0003
FieldName1  0004    0004
FieldName1  0005    0005
FieldName2  0001    0001
FieldName3  0001    0001
FieldName4  0001    0001
FieldName5  0001    0001
FieldName6  0001    0001

MoreInfo    MoreInfo    MoreInfo    MoreInfo    MoreInfo
MoreInfo2   MoreInfo2   MoreInfo2   MoreInfo2   MoreInfo2   MoreInfo2
MoreInfo3   MoreInfo3   MoreInfo3   MoreInfo3   MoreInfo3
MoreInfo4   MoreInfo4
FieldName1  0001    0001
FieldName1  0002    0002
FieldName1  0003    0003
FieldName1  0004    0004
FieldName1  0005    0005
FieldName2  0001    0001
FieldName3  0001    0001
FieldName4  0001    0001
FieldName5  0001    0001
FieldName6  0001    0001

Here is some code I've worked on. It is able to read the first block and append it to a list:

with open(loc, 'r') as f:
    for i in range(16):
        f.readline()

    data = []
    line = f.readline()
    if line == "\n":
        dataLine = f.readline()
        while dataLine != "\n":
            data.append(dataLine)
            dataLine = f.readline()

    #pass data list to function
    function_call(data)
    # reset data list here?
    data = []

How do I make it so that it works for the full file? My assumption was that using "with open", it acted as a "while not end of file". I tried adding a "while True" after skipping the first 16 lines. I have little knowledge of Python's parsing capabilities.

Thank you in advanced for any help.

First: 'My assumption was that using "with open", it acted as a "while not end of file".' That's wrong. `with open` doesn't do any looping; it just makes sure that the file you `open`ed gets `close`d when you're done. — abarnert, May 26 '15 at 00:58
More importantly: 'I tried adding a "while True" after skipping the first 16 lines' is a perfectly good approach. If it didn't work for you, obviously you got something wrong with it. If you show us the code you tried, we can show you how to fix it; if you just describe it, there's not much anyone can do for you. — abarnert, May 26 '15 at 00:59
You should look into using ``itertools.groupby()`` and create a key function that changes when it sees a ``\n`` on it's own. — James Mills, May 26 '15 at 01:04
i.e: You need to "repeat" the block of code you've already written to read the first block of data. — James Mills, May 26 '15 at 01:06

score 3 · Accepted Answer · answered May 26 '15 at 01:05

Adding a while True after the initial skipping should definitely work. Of course you have to get all the details right.

You could try to extend the approach you already have, with a nested while loop inside the outer loop. But it may be easier to think about it as a single loop. For each line, there's only three things you might have to do:

If there is no line, because you're at EOF, break out of the loop, making sure to process the old data (the last block in the file) if there was one first.
If it's a blank line, start a new data, making sure to process the old data if there was one first.
Otherwise, append to the existing data.

So:

with open(loc, 'r') as f:
    for i in range(16):
        f.readline()

    data = []
    while True:
        line = f.readline()
        if not line:
            if data:
                function_call(data)
            break
        if line == "\n":
            if data:
                function_call(data)
                data = []
        else:
            data.append(line)

There are a couple ways you could simplify this further:

Use a for line in f: instead of a while loop that repeatedly does f.readline() and checks it.
Use groupby to transform the iterator of lines into an iterator of blank-line-separated groups of lines.

Thank you so much @abarnert. This worked and helped me prevent more headaches. I will look into possibly refactoring the code using either the 'for lien in f:' or using groupby. — who_lee_oh, May 26 '15 at 17:04

score 0 · Answer 2 · answered May 26 '15 at 01:35

In case you are still struggling with this here is an implementation that reads your sample data using itertools.groupby() and a key function search():

from itertools import groupby, repeat

def search(d):
    """Key function used to group our dataset"""

    return d[0] == "\n"

def read_data(filename):
    """Read data from filename and return a nicer data structure"""

    data = []

    with open(filename, "r") as f:
        # Skip first 16 lines
        for _ in repeat(None, 16):
            f.readline()

        # iterate through each data block
        for newblock, records in groupby(f, search):
            if newblock:
                # we've found a new block
                # create a new row of data
                data.append([])
            else:
                # we've found data for the current block
                # add each row to the last row
                for row in records:
                    row = row.strip().split()
                    data[-1].append(row)

    return data

This will result in a data structure that is a nested list of blocks. Each sublist is separate by the \n grouping in your data file.

score 0 · Answer 3 · answered May 26 '15 at 03:32

The pattern of blocks in your file is they consist of groups of lines terminated by either a blank line or the end of the file. This logic could be encapsulated in a generator function that yielded the blocks of lines from your file iteratively which would simplify the rest of the script.

In the following, getlines() is the generator function. Also note that the first 17 lines of the file are skipped to get to the beginning of the first block.

from pprint import pformat

loc = 'parsing_test_file.txt'

def function(lines):
    print('function called with:\n{}'.format(pformat(lines)))

def getlines(f):
    lines = []
    while True:
        try:
            line = next(f)
            if line != '\n':  # not end of the block?
                lines.append(line)
            else:
                yield lines
                lines = []
        except StopIteration:  # end of file
            if lines:
                yield lines
            break

with open(loc, 'r') as f:
    for i in range(17):
        next(f)

    for lines in getlines(f):
        function(lines)

print('done')

Output using your test file:

function called with:
['MoreInfo    MoreInfo    MoreInfo    MoreInfo    MoreInfo\n',
 'MoreInfo2   MoreInfo2   MoreInfo2   MoreInfo2   MoreInfo2   MoreInfo2\n',
 'MoreInfo3   MoreInfo3   MoreInfo3   MoreInfo3   MoreInfo3\n',
 'MoreInfo4   MoreInfo4\n',
 'FieldName1  0001    0001\n',
 'FieldName1  0002    0002\n',
 'FieldName1  0003    0003\n',
 'FieldName1  0004    0004\n',
 'FieldName1  0005    0005\n',
 'FieldName2  0001    0001\n',
 'FieldName3  0001    0001\n',
 'FieldName4  0001    0001\n',
 'FieldName5  0001    0001\n',
 'FieldName6  0001    0001\n']
function called with:
['MoreInfo    MoreInfo    MoreInfo    MoreInfo    MoreInfo\n',
 'MoreInfo2   MoreInfo2   MoreInfo2   MoreInfo2   MoreInfo2   MoreInfo2\n',
 'MoreInfo3   MoreInfo3   MoreInfo3   MoreInfo3   MoreInfo3\n',
 'MoreInfo4   MoreInfo4\n',
 'FieldName1  0001    0001\n',
 'FieldName1  0002    0002\n',
 'FieldName1  0003    0003\n',
 'FieldName1  0004    0004\n',
 'FieldName1  0005    0005\n',
 'FieldName2  0001    0001\n',
 'FieldName3  0001    0001\n',
 'FieldName4  0001    0001\n',
 'FieldName5  0001    0001\n',
 'FieldName6  0001    0001\n']
done

Parsing newline delimited file

3 Answers3