I would appreciate comments on parsing a complex tab delimited text file using Python

Question

I wonder if anyone can help? I am a beginner with Python, have been using pandas and sklearn to analyse well formatted csv files, but now I want to extract data for analysis from a tab delimited text file with more complex formatting and I'm not really sure where to start.

I've made a simple example of the file below. It is essentially time series data which captures changing quantities with time for objects within a hierarchy. So in the example below, the top level might be 'All Cars' under which would be sub groups such as 'soft tops' and also lowest level objects which would be individual cars. There are several header lines, the first always captures the level and name of quantity being tracked, but there are several other lines to carry units (typically 3 or 4) and the last header line can also contain the names of groups or of individual objects (eg 'Soft Tops' or 'BMW'). There are 10 columns across in the real data (5 shown in example) and around 500 dates per table (3 in example). The dates section repeats multiple times (always same dates) creating multiple tables (I show only a single repeat in the example). A typical file might therefore have 200K to 500K lines in total.

I want to extract particular columns and output to a csv file, where I can operate on them further.

An example of the format I'm dealing with is as follows:

File Name
Date          All_Cars_MPG   All_Cars_Doors   Group_MPG   Car_MPG
              Units          Units            Units       Units
              Units          Units            Units       Units
              Units          Units            Soft Tops   BMW
Line of tab separated spaces
01-NOV-2015   32.5           4                18.2        25
01-DEC-2015   30.5           4                15.8        22
01-JAN-2016   35.0           5                19.0        26
Line of spaces or tab separated spaces
File name (same as above)
Date          Car_Doors     Car_MPG           Car_Doors   Car_Speed
              Units         Units             Units       Units
              Units         Units             Units       Units
              BMW           AUDI              AUDI        NISSAN
Line of tab separated spaces
01-NOV-2015   5             35                2           250
01-DEC-2015   5             12                8           220
01-JAN-2016   6             19                0           260

I'm guessing I need to iterate over lines in the files and extract the columns I need to build up the CSV (perhaps by keying off the names of the lowest level items which in the example all begin with the letter 'C' for Car) but not sure how to set that up. Having found a string beginning with C in a header line, how best to append the object name (from the bottom header line) and then extract the data column below it? And having extracted one column, the next one I would need would of course be in a different position in the stack of tables, so can't use a fixed position in a list or dictionary. I don't expect a detailed solution, but if someone could highlight the general approach needed I'd be grateful. (Splitting? Dictionaries?)

I am particularly interested in the lowest level of the hierarchy, and the form of the csv would have the date on the left hand side and then as many columns as required to the right as follows :

Date,Car_MPG_BMW,Car_Doors_BMW,Car_MPG_AUDI,Car_Doors_AUDI
01-NOV-2015,25,5,35,2
01-DEC-2015,22,5,23,8
01-JAN-2016,26,6,29,0

EDIT TO ADD:

Here is an attempt to paste a more accurate version of an input file:

    SUMMARY OF RUN                                                    
    DATE            WOPT            WOPT            WOPT            WOPT            WOPT            WOPT            WOPT            WOPT            WTHP           
                    STB             STB             STB             STB             STB             STB             STB             STB             PSIA           
                                                                    *10**3                                                                                         
                    B1A             B2              B3              B4              B5              B6              B7              B9              B1A            

     01-JAN-2046     403847.8              0        8613069.        18449.29               0               0               0               0               0             
     01-FEB-2046     403847.8              0        8633593.        18471.77               0               0               0               0               0         
     01-MAR-2046     403847.8              0        8652024.        18492.03               0               0               0               0               0         
     01-APR-2046     403847.8              0        8671890.        18514.38               0               0               0               0               0         
     01-MAY-2046     403847.8              0        8689601.        18535.93               0               0               0               0               0         
     01-JUN-2046     403847.8              0        8707051.        18558.15               0               0               0               0               0         
     01-JUL-2046     403847.8              0        8723709.        18579.61               0               0               0               0               0         
     01-AUG-2046     403847.8              0        8740806.        18601.75               0               0               0               0               0         
     01-SEP-2046     403847.8              0        8757767.        18623.84               0               0               0               0               0         
     01-OCT-2046     403847.8              0        8774027.        18645.17               0               0               0               0               0         
     01-NOV-2046     403847.8              0        8790653.        18667.15               0               0               0               0               0         
     01-DEC-2046     403847.8              0        8806563.        18688.37               0               0               0               0               0         
     01-JAN-2047     403847.8              0        8822815.        18710.24               0               0               0               0               0         

    SUMMARY OF RUN                                                    
    DATE            WTHP            WTHP            WTHP            WTHP            WTHP            WTHP            WTHP            WTHP            WTHP           
                    PSIA            PSIA            PSIA            PSIA            PSIA            PSIA            PSIA            PSIA            PSIA           
                    B2              B3              B4              B5              B6              B7              B9              B10             B16Z           

     01-JAN-2046            0              0               0               0               0               0        180.0000               0               0         
     01-FEB-2046            0              0               0               0               0               0        180.0000               0               0         
     01-MAR-2046            0              0               0               0               0               0        180.0000               0               0         
     01-APR-2046            0              0               0               0               0               0        180.0000               0               0         
     01-MAY-2046            0              0               0               0               0               0        180.0000               0               0         
     01-JUN-2046            0              0               0               0               0               0        180.0000               0               0         
     01-JUL-2046            0              0               0               0               0               0        180.0000               0               0         
     01-AUG-2046            0              0               0               0               0               0        180.0000               0               0         
     01-SEP-2046            0              0               0               0               0               0        180.0000               0               0         
     01-OCT-2046            0              0               0               0               0               0        180.0000               0               0         
     01-NOV-2046            0              0               0               0               0               0        180.0000               0               0         
     01-DEC-2046            0              0               0               0               0               0        180.0000               0               0         
     01-JAN-2047            0              0               0               0               0               0        180.0000               0               0

And here is a sample output file :

DATE,WOPT_B1A,WOPT_B2,WTHP_B1A,WTHP_B2
01-JAN-2046,403847.8,0,0,0
01-FEB-2046,403847.8,0,0,0
01-MAR-2046,403847.8,0,0,0
01-APR-2046,403847.8,0,0,0
01-MAY-2046,403847.8,0,0,0
01-JUN-2046,403847.8,0,0,0
01-JUL-2046,403847.8,0,0,0
01-AUG-2046,403847.8,0,0,0
01-SEP-2046,403847.8,0,0,0
01-OCT-2046,403847.8,0,0,0
01-NOV-2046,403847.8,0,0,0
01-DEC-2046,403847.8,0,0,0
01-JAN-2047,403847.8,0,0,0

Look into a `pandas` `DataFrame`. You will have to parse it line by line and if it will always follow the same format you can just splice with the `[:]` syntax or you can use regex. — Error - Syntactical Remorse, Apr 19 '19 at 20:28
That being said please paste your file WITHOUT any explanations in it. Show the exact file you are reading and the output you want. — Error - Syntactical Remorse, Apr 19 '19 at 20:29
Hi ESR, thanks. Whats the best way to post a section of the original file? To give a fair representation of the repeating tables I will need to paste around 3000 lines of text, each line 166 characters wide. I'm assuming I can't upload an attachment. If not I'll see if I can generate a shortened version and try to copy into my original message. — Hamish, Apr 19 '19 at 22:19
Try to find a minimal subset that meets your needs at paste that. You can always upload it to git hub but you should first try to find a minimal example. — Error - Syntactical Remorse, Apr 19 '19 at 22:34
Thanks ESR, some more output added above. I'm looking into some examples now using regex. — Hamish, Apr 19 '19 at 22:49

score 0 · Answer 1 · edited Jun 20 '20 at 09:12

So... There is a decent amount of assumptions in this code but it works for the example you gave. It may not work for everything and it could be sped up in some spots but I don't think speed is the biggest issue and I think you can make the needed changes for the things that don't work.

Step 1:
We need to turn the .txt file to list of lists.

def get_tab_delimited_lines(file):
    lines = []
    with open(file, 'r') as f:
        for line in f.readlines():
                line = line.split('\t') # Split by \t (TAB)
                line = [x.strip() for x in line] # Remove white space
                lines.append(line)
    return lines

Step 2:
Separate the body (table) from the header (column) info.

import re # This should go at the top of the file
def get_header_and_body(lines):
    # Lets seperate the header info from the body
    header_info = [] # This is the list we will return for header info
    body = [] # This is the list we will return for body info
    temp_body = []
    temp_header_info = []
    header = True
    for line in lines:
        # If the first part of the line is a date
        # in the format [a few numbers]-[a few letterss]-[a few numbers]
        # Example: 01-JAN-2046
        if re.match(r'[0-9]+-[A-Z]+-[0-9]+', line[0]): # If a date then it is the body
            header = False
            temp_body.append(line[:-1]) # The last element is always an empty '' so remove it
        else: # Else this is header info
            header = True
            if temp_body: # Append the body if we have any
                body.append(temp_body)
            temp_body = [] # Reset the temp
        if header: # If this is a header
            # This is a set of the lines we dont need. If the line
            # starts with any of these we will ignore it.
            unwanted_starts_to_a_line = {'SUMMARY OF RUN', 'STB', '', 'PSIA'}
            # We will also ignore line with things such as *18**.
            if line and line[0] not in unwanted_starts_to_a_line and not re.match(r'\*[0-9]+\*\*', line[0]):
                temp_header_info.append(line)
        else:
            if temp_header_info:
                header_info.append(temp_header_info)
            temp_header_info = []

    if temp_body:
        body.append(temp_body)
    if temp_header_info:
        header_info.append(temp_header_info)
    return header_info, body

Step 3:
Now create the new column headers you want: I reverse the header_info because the date does not append with anything else. So I reverse both header rows, zip them together, then I reverse it back to the order we want.

def change_to_table_headers(header_info):
    for index in range(len(header_info)):
        # print(header_info[index]) # uncomment this to see why I did the `reversed`
        # and feel free to remove the `reversed` to see what breaks.
        t = list(zip(reversed(header_info[index][0]), reversed(header_info[index][1])))
        t.reverse()
        t = ['_'.join(x) for x in t]
        header_info[index] = ['DATE'] + t

Step 4:
Piece it all together:

import pandas as pd  # This should go at the top of the file

lines = get_tab_delimited_lines('test.txt')
header_info, body = get_header_and_body(lines)
change_to_table_headers(header_info)

for index in range(len(header_info)):
    
    headers = header_info[index]
    df = pd.DataFrame(body[index], columns=headers)
    print(df)

Now that it is in a dataframe to can send it straight to a csv or do anything you want with it.

Appendix:

I used the following in a test.txt document to test it.

SUMMARY OF RUN                                                    
DATE            WOPT            WOPT            WOPT            WOPT            WOPT            WOPT            WOPT            WOPT            WTHP           
                STB             STB             STB             STB             STB             STB             STB             STB             PSIA           
                                                                *10**3                                                                                         
                B1A             B2              B3              B4              B5              B6              B7              B9              B1A            
                                                                                                                                                               
 01-JAN-2046     403847.8              0        8613069.        18449.29               0               0               0               0               0             
 01-FEB-2046     403847.8              0        8633593.        18471.77               0               0               0               0               0         
 01-MAR-2046     403847.8              0        8652024.        18492.03               0               0               0               0               0         
 01-APR-2046     403847.8              0        8671890.        18514.38               0               0               0               0               0         
 01-MAY-2046     403847.8              0        8689601.        18535.93               0               0               0               0               0         
 01-JUN-2046     403847.8              0        8707051.        18558.15               0               0               0               0               0         
 01-JUL-2046     403847.8              0        8723709.        18579.61               0               0               0               0               0         
 01-AUG-2046     403847.8              0        8740806.        18601.75               0               0               0               0               0         
 01-SEP-2046     403847.8              0        8757767.        18623.84               0               0               0               0               0         
 01-OCT-2046     403847.8              0        8774027.        18645.17               0               0               0               0               0         
 01-NOV-2046     403847.8              0        8790653.        18667.15               0               0               0               0               0         
 01-DEC-2046     403847.8              0        8806563.        18688.37               0               0               0               0               0         
 01-JAN-2047     403847.8              0        8822815.        18710.24               0               0               0               0               0         
                                                                                                                                       
SUMMARY OF RUN                                                    
DATE            WTHP            WTHP            WTHP            WTHP            WTHP            WTHP            WTHP            WTHP            WTHP           
                PSIA            PSIA            PSIA            PSIA            PSIA            PSIA            PSIA            PSIA            PSIA           
                B2              B3              B4              B5              B6              B7              B9              B10             B16Z           
                                                                                                                                                               
 01-JAN-2046            0              0               0               0               0               0        180.0000               0               0         
 01-FEB-2046            0              0               0               0               0               0        180.0000               0               0         
 01-MAR-2046            0              0               0               0               0               0        180.0000               0               0         
 01-APR-2046            0              0               0               0               0               0        180.0000               0               0         
 01-MAY-2046            0              0               0               0               0               0        180.0000               0               0         
 01-JUN-2046            0              0               0               0               0               0        180.0000               0               0         
 01-JUL-2046            0              0               0               0               0               0        180.0000               0               0         
 01-AUG-2046            0              0               0               0               0               0        180.0000               0               0         
 01-SEP-2046            0              0               0               0               0               0        180.0000               0               0         
 01-OCT-2046            0              0               0               0               0               0        180.0000               0               0         
 01-NOV-2046            0              0               0               0               0               0        180.0000               0               0         
 01-DEC-2046            0              0               0               0               0               0        180.0000               0               0         
 01-JAN-2047            0              0               0               0               0               0        180.0000               0               0

Fantastic job ESR, exactly what I need to get me started. Beautifully laid out and explained. I'm very grateful. I'm working through line by line to understand every element, and then I can build on it. I copied out the text from your appendix to a new file for testing, am getting "AssertionError: 2 columns passed, passed data had 0 columns" at the moment, trying to figure that one out. — Hamish, Apr 20 '19 at 08:49
Use the data from you question. It could be when I coped it to So it removed the tab special characters. — Error - Syntactical Remorse, Apr 20 '19 at 14:03

I would appreciate comments on parsing a complex tab delimited text file using Python

1 Answers1

Appendix: