Using Python with semi-structured data, how to add a column value based on text encountered in preceding row

Question

I am trying to transform some data into a structured format and do a minor transformation. The source is a .csv file that is actually semi-structured that looks like this:

I would like the resulting data from output to look like this, and it is ok if the approach searches for the literal text 'Thing'

I am not partial to implementation but have looked all around petl and pandas with no success. How would you tackle this?

kindly share data not pics – sammywemmy Feb 24 '21 at 22:55 — sammywemmy, Feb 24 '21 at 22:55
What do you want to label the column names? – elPastor Feb 24 '21 at 23:43 — elPastor, Feb 24 '21 at 23:43

Jacob Miller · Accepted Answer · 2021-02-25T17:58:44.227

import pandas as pd  
  
thing1 = pd.DataFrame([['A', 'B', 'C'], ['D', 'E', 'F'], ['G', 'H', 'I']])
thing2 = pd.DataFrame([['Z', 'Y', 'X'], ['A', 'V', 'U'], ['M', 'L', 'P']])

thing1.insert(0, 'label', value = 'Thing1')
thing2.insert(0, 'label', value = 'Thing2')

thing1.append(thing2)

Out[17]: 
    label  0  1  2
0  Thing1  A  B  C
1  Thing1  D  E  F
2  Thing1  G  H  I
0  Thing2  Z  Y  X
1  Thing2  A  V  U
2  Thing2  M  L  P

EDIT TO ADDRESS COMMENT

There may be a more elegant way to do this (note the addition of reset_index here, which is necessary for subsequent slicing):

In [36]: thing3 = thing1.append(thing2).reset_index(drop = True)

In [37]: thing3
Out[37]: 
    label  0  1  2
0  Thing1  A  B  C # <-- slice from first 'A'
1  Thing1  D  E  F
2  Thing1  G  H  I
3  Thing2  Z  Y  X
4  Thing2  A  V  U # <-- to second 'A'
5  Thing2  M  L  P

In [38]: mask = thing3[0].between('A', 'A')

In [39]: mask
Out[39]: 
0     True
1    False
2    False
3    False
4     True
5    False
Name: 0, dtype: bool

In [40]: thing3[mask[mask].index[0]: mask[mask].index[1]]
Out[40]: 
    label  0  1  2
0  Thing1  A  B  C
1  Thing1  D  E  F
2  Thing1  G  H  I
3  Thing2  Z  Y  X

Or, if you'd like to include the last row, just add 1 to the slice:

In [41]: thing3[mask[mask].index[0]: mask[mask].index[1] + 1]
Out[41]: 
    label  0  1  2
0  Thing1  A  B  C
1  Thing1  D  E  F
2  Thing1  G  H  I
3  Thing2  Z  Y  X
4  Thing2  A  V  U

My knowledge is larger and I can work with this! Thanks, Jacob. If you have and insight into how to dynamically define the DataFrames based on the nature of the file. For example (bad example) "Thing 1's data frame consists of all the rows between the first row that starts with "thing" and the next encountered row that starts with thing. (and so on). If not, so what. I am very grateful for your kind knowledge share. — Mike G, Feb 25 '21 at 02:38

Using Python with semi-structured data, how to add a column value based on text encountered in preceding row

1 Answers1