Wrangling a time series data set in an awkward format from a text file to a Pandas dataframe in Python

Question

I have a .txt containing a time series data set, formatted in the following manner, as rows separated with \n:

N>New Section  
A>1, 2, 3  
L>Label_1  
G>1, 2, 3  
A>3, 2, 1  
G>3, 1, 1  
A>2, 2, 1

...many rows of G> and A> pairs, of varying number then a new L>Label section...

L>Label_1  
G>1, 2, 3  
A>3, 2, 1  
G>3, 1, 1  
A>2, 2, 1

...and sometimes a new N>New section...

N>New Section  
A>1, 2, 3  
L>Label_2  
G>1, 2, 3  
A>3, 2, 1  
G>3, 1, 1  
A>2, 2, 1

...repeated many times.

Each pair of G> and A> rows are a set of x, y, z measurements at a timestamp in a timeseries composed of hundreds of measurements, associated with the L>Label which came before them. There are 6 possible labels.

The desired output format, is a pandas dataframe with the following columns, preserving all the information in the data set.

Time, G_x, G_y, G_z, A_x, A_y, A_z, Label, Label_Measurement,Section
    0   1   2    3    3    2    1   Label_1   Label_1_1       1
    1   3   1    1    2    2    1   Label_1   Label_1_1       1  
    0   1   2    3    3    2    1   Label_1   Label_1_2       1  
    1   3   1    1    2    2    1   Label_1   Label_1_2       1  
    0   1   2    3    3    2    1   Label_2   Label_2_1       2  
    1   3   1    1    2    2    1   Label_2   Label_2_1       2

I have a few ideas of elements I think I'm going to need for the solution.

I can make some counters, to count each type of line in the file as they appear:

counter_New_section = 0  
counter_Label = 0  
counter_A_row = 0  
counter_G_row = 0

and I can iterate over each line of the file and use some conditionals to identify what type of line each line is and do the appropriate action with it:

with open(dataset.txt) as dataset:  
    for line in dataset:
        if line.startswith('N>') ... do the appropriate
        if line.startswith('L>')  ... "               "
        if line.startswith('G>')  ... "               "
        if line.startswith('A>')  ... "               "

...I'm struggling. What is the best approach here to achieve my desired dataframe output?
Thanks

Is there always exactly one `A>` row between each `N>New Section` and `L>Label_[...]`? — Peter Leimbigler, Oct 24 '18 at 15:32
Yes. Those rows are not part of the time series and I don't know what their significance is (no README for the data set is provided). I intended to disgard them in the first instance. — Robert Constable, Oct 24 '18 at 17:29

Wrangling a time series data set in an awkward format from a text file to a Pandas dataframe in Python

0 Answers0