Python file parsing -> IndexError

Question

I am parsing through an ISI file with a few hundred records that all begin with a 'PT J' tag and end with an 'ER' tag. I am trying to pull the tagged info from each record within a nested loop but keep getting an IndexError. I know why I am getting it, but does anyone have a better way of identifying the start of new records than checking the first few characters?

    while file:
        while line[1] + line[2] + line[3] + line[4] != 'PT J':
            ...                
            Search through and record data from tags
            ...

I am using this same method and therefore occasionally getting the same problem with identifying tags, so if you have any suggestions for that as well I would greatly appreciate it!

Sample data, which you'll notice does not always include every tag for each record, is:

    PT J
    AF Bob Smith
    TI Python For Dummies
    DT July 4, 2012
    ER

    PT J
    TI Django for Dummies
    DT 4/14/2012
    ER

    PT J
    AF Jim Brown
    TI StackOverflow
    ER

I would like to point out that I am converting this to a .txt as well before reading it. — MTP, Jul 06 '12 at 02:47

Ashwini Chaudhary · Answer 1 · 2012-07-06T03:08:23.813

3

with open('data1.txt') as f:
    for line in f:
        if line.strip()=='PT J':
            for line in f:
                if line.strip()!='ER' and line.strip():
                    #do something with data
                elif line.strip()=='ER':
                     #this record ends here move to the next record
                     break

edited Jul 06 '12 at 03:08

answered Jul 06 '12 at 03:00

Ashwini Chaudhary

244,495
58
464
504

I think I see what's going on here, however, how would I access different lines to manipulate or test them? Since line is acting as an iterator, we can't say within the nested 'if' statement something to the effect of line=file.readline() What would replace the line=file.readline() to allow me to get to specific lines??? I ask because in some instances there are multiple entities per tag. – MTP Jul 07 '12 at 03:54

score 2 · Accepted Answer · answered Jul 06 '12 at 02:51

2

Do the 'ER' lines only contain 'ER'? That would be why you're getting IndexErrors, because line[4] doesn't exist.

The first thing to to try would be:

while not line.startswith('PT J'):

instead of your existing while loop.

Also, slices:

line[1] + line[2] + line[3] + line[4] == line[1:5]

(The ends of slices are noninclusive)

answered Jul 06 '12 at 02:51

Marius

58,213
16
107
105

Yes, 'ER' (End of Record) lines typically do not contain anything else, not even trailing spaces. – Klaus-Dieter Warzecha Jul 06 '12 at 08:15
I like your suggestion...I will have to play more with it. – MTP Jul 07 '12 at 02:57

Levon · Answer 3 · 2012-07-06T03:36:52.613

You could try an approach like this to read through your file.

with open('data.txt') as f:
    for line in f:
        line = line.split() # splits your line into a list of character sequences
                            # separated based on whitespace (blanks, tabs)
        llen = len(line)
        if llen == 2 and line[0] == 'PT' and line[1] == 'J': # found start of record
           # process
           # examine line[0] for 'tags', such as "AF", "TI", "DT" and proceed
           # as dictated by your needs. 
           # e.g., 

        if llen > 1 and line[0] == "AF": # grab first/last name in line[1] and line[2]

           # The data will be on the same line and
           # accessible via the correct index values.

        if lline == 1 and line[0] == 'ER': # found end of record.

This definitely needs more "programming logic" (most likely embedded loops, or better yet, calls to functions) to put everything in the right order/sequence, but the basic constructs are there and I hope will get you started and gives you some ideas.

Python file parsing -> IndexError

3 Answers3