0

I am parsing through an ISI file with a few hundred records that all begin with a 'PT J' tag and end with an 'ER' tag. I am trying to pull the tagged info from each record within a nested loop but keep getting an IndexError. I know why I am getting it, but does anyone have a better way of identifying the start of new records than checking the first few characters?

    while file:
        while line[1] + line[2] + line[3] + line[4] != 'PT J':
            ...                
            Search through and record data from tags
            ...

I am using this same method and therefore occasionally getting the same problem with identifying tags, so if you have any suggestions for that as well I would greatly appreciate it!

Sample data, which you'll notice does not always include every tag for each record, is:

    PT J
    AF Bob Smith
    TI Python For Dummies
    DT July 4, 2012
    ER

    PT J
    TI Django for Dummies
    DT 4/14/2012
    ER

    PT J
    AF Jim Brown
    TI StackOverflow
    ER
MTP
  • 387
  • 1
  • 3
  • 8
  • I would like to point out that I am converting this to a .txt as well before reading it. – MTP Jul 06 '12 at 02:47

3 Answers3

3
with open('data1.txt') as f:
    for line in f:
        if line.strip()=='PT J':
            for line in f:
                if line.strip()!='ER' and line.strip():
                    #do something with data
                elif line.strip()=='ER':
                     #this record ends here move to the next record
                     break
Ashwini Chaudhary
  • 244,495
  • 58
  • 464
  • 504
  • I think I see what's going on here, however, how would I access different lines to manipulate or test them? Since line is acting as an iterator, we can't say within the nested 'if' statement something to the effect of line=file.readline() What would replace the line=file.readline() to allow me to get to specific lines??? I ask because in some instances there are multiple entities per tag. – MTP Jul 07 '12 at 03:54
2

Do the 'ER' lines only contain 'ER'? That would be why you're getting IndexErrors, because line[4] doesn't exist.

The first thing to to try would be:

while not line.startswith('PT J'):

instead of your existing while loop.

Also, slices:

line[1] + line[2] + line[3] + line[4] == line[1:5] 

(The ends of slices are noninclusive)

Marius
  • 58,213
  • 16
  • 107
  • 105
0

You could try an approach like this to read through your file.

with open('data.txt') as f:
    for line in f:
        line = line.split() # splits your line into a list of character sequences
                            # separated based on whitespace (blanks, tabs)
        llen = len(line)
        if llen == 2 and line[0] == 'PT' and line[1] == 'J': # found start of record
           # process
           # examine line[0] for 'tags', such as "AF", "TI", "DT" and proceed
           # as dictated by your needs. 
           # e.g., 

        if llen > 1 and line[0] == "AF": # grab first/last name in line[1] and line[2]

           # The data will be on the same line and
           # accessible via the correct index values.

        if lline == 1 and line[0] == 'ER': # found end of record.

This definitely needs more "programming logic" (most likely embedded loops, or better yet, calls to functions) to put everything in the right order/sequence, but the basic constructs are there and I hope will get you started and gives you some ideas.

Levon
  • 138,105
  • 33
  • 200
  • 191