I wonder if anyone can help? I am a beginner with Python, have been using pandas and sklearn to analyse well formatted csv files, but now I want to extract data for analysis from a tab delimited text file with more complex formatting and I'm not really sure where to start.
I've made a simple example of the file below. It is essentially time series data which captures changing quantities with time for objects within a hierarchy. So in the example below, the top level might be 'All Cars' under which would be sub groups such as 'soft tops' and also lowest level objects which would be individual cars. There are several header lines, the first always captures the level and name of quantity being tracked, but there are several other lines to carry units (typically 3 or 4) and the last header line can also contain the names of groups or of individual objects (eg 'Soft Tops' or 'BMW'). There are 10 columns across in the real data (5 shown in example) and around 500 dates per table (3 in example). The dates section repeats multiple times (always same dates) creating multiple tables (I show only a single repeat in the example). A typical file might therefore have 200K to 500K lines in total.
I want to extract particular columns and output to a csv file, where I can operate on them further.
An example of the format I'm dealing with is as follows:
File Name
Date All_Cars_MPG All_Cars_Doors Group_MPG Car_MPG
Units Units Units Units
Units Units Units Units
Units Units Soft Tops BMW
Line of tab separated spaces
01-NOV-2015 32.5 4 18.2 25
01-DEC-2015 30.5 4 15.8 22
01-JAN-2016 35.0 5 19.0 26
Line of spaces or tab separated spaces
File name (same as above)
Date Car_Doors Car_MPG Car_Doors Car_Speed
Units Units Units Units
Units Units Units Units
BMW AUDI AUDI NISSAN
Line of tab separated spaces
01-NOV-2015 5 35 2 250
01-DEC-2015 5 12 8 220
01-JAN-2016 6 19 0 260
I'm guessing I need to iterate over lines in the files and extract the columns I need to build up the CSV (perhaps by keying off the names of the lowest level items which in the example all begin with the letter 'C' for Car) but not sure how to set that up. Having found a string beginning with C in a header line, how best to append the object name (from the bottom header line) and then extract the data column below it? And having extracted one column, the next one I would need would of course be in a different position in the stack of tables, so can't use a fixed position in a list or dictionary. I don't expect a detailed solution, but if someone could highlight the general approach needed I'd be grateful. (Splitting? Dictionaries?)
I am particularly interested in the lowest level of the hierarchy, and the form of the csv would have the date on the left hand side and then as many columns as required to the right as follows :
Date,Car_MPG_BMW,Car_Doors_BMW,Car_MPG_AUDI,Car_Doors_AUDI
01-NOV-2015,25,5,35,2
01-DEC-2015,22,5,23,8
01-JAN-2016,26,6,29,0
EDIT TO ADD:
Here is an attempt to paste a more accurate version of an input file:
SUMMARY OF RUN
DATE WOPT WOPT WOPT WOPT WOPT WOPT WOPT WOPT WTHP
STB STB STB STB STB STB STB STB PSIA
*10**3
B1A B2 B3 B4 B5 B6 B7 B9 B1A
01-JAN-2046 403847.8 0 8613069. 18449.29 0 0 0 0 0
01-FEB-2046 403847.8 0 8633593. 18471.77 0 0 0 0 0
01-MAR-2046 403847.8 0 8652024. 18492.03 0 0 0 0 0
01-APR-2046 403847.8 0 8671890. 18514.38 0 0 0 0 0
01-MAY-2046 403847.8 0 8689601. 18535.93 0 0 0 0 0
01-JUN-2046 403847.8 0 8707051. 18558.15 0 0 0 0 0
01-JUL-2046 403847.8 0 8723709. 18579.61 0 0 0 0 0
01-AUG-2046 403847.8 0 8740806. 18601.75 0 0 0 0 0
01-SEP-2046 403847.8 0 8757767. 18623.84 0 0 0 0 0
01-OCT-2046 403847.8 0 8774027. 18645.17 0 0 0 0 0
01-NOV-2046 403847.8 0 8790653. 18667.15 0 0 0 0 0
01-DEC-2046 403847.8 0 8806563. 18688.37 0 0 0 0 0
01-JAN-2047 403847.8 0 8822815. 18710.24 0 0 0 0 0
SUMMARY OF RUN
DATE WTHP WTHP WTHP WTHP WTHP WTHP WTHP WTHP WTHP
PSIA PSIA PSIA PSIA PSIA PSIA PSIA PSIA PSIA
B2 B3 B4 B5 B6 B7 B9 B10 B16Z
01-JAN-2046 0 0 0 0 0 0 180.0000 0 0
01-FEB-2046 0 0 0 0 0 0 180.0000 0 0
01-MAR-2046 0 0 0 0 0 0 180.0000 0 0
01-APR-2046 0 0 0 0 0 0 180.0000 0 0
01-MAY-2046 0 0 0 0 0 0 180.0000 0 0
01-JUN-2046 0 0 0 0 0 0 180.0000 0 0
01-JUL-2046 0 0 0 0 0 0 180.0000 0 0
01-AUG-2046 0 0 0 0 0 0 180.0000 0 0
01-SEP-2046 0 0 0 0 0 0 180.0000 0 0
01-OCT-2046 0 0 0 0 0 0 180.0000 0 0
01-NOV-2046 0 0 0 0 0 0 180.0000 0 0
01-DEC-2046 0 0 0 0 0 0 180.0000 0 0
01-JAN-2047 0 0 0 0 0 0 180.0000 0 0
And here is a sample output file :
DATE,WOPT_B1A,WOPT_B2,WTHP_B1A,WTHP_B2
01-JAN-2046,403847.8,0,0,0
01-FEB-2046,403847.8,0,0,0
01-MAR-2046,403847.8,0,0,0
01-APR-2046,403847.8,0,0,0
01-MAY-2046,403847.8,0,0,0
01-JUN-2046,403847.8,0,0,0
01-JUL-2046,403847.8,0,0,0
01-AUG-2046,403847.8,0,0,0
01-SEP-2046,403847.8,0,0,0
01-OCT-2046,403847.8,0,0,0
01-NOV-2046,403847.8,0,0,0
01-DEC-2046,403847.8,0,0,0
01-JAN-2047,403847.8,0,0,0