0

I am trying to transform some data into a structured format and do a minor transformation. The source is a .csv file that is actually semi-structured that looks like this:

enter image description here

I would like the resulting data from output to look like this, and it is ok if the approach searches for the literal text 'Thing'

enter image description here

I am not partial to implementation but have looked all around petl and pandas with no success. How would you tackle this?

Mike G
  • 712
  • 5
  • 9

1 Answers1

1
import pandas as pd  
  
thing1 = pd.DataFrame([['A', 'B', 'C'], ['D', 'E', 'F'], ['G', 'H', 'I']])
thing2 = pd.DataFrame([['Z', 'Y', 'X'], ['A', 'V', 'U'], ['M', 'L', 'P']])

thing1.insert(0, 'label', value = 'Thing1')
thing2.insert(0, 'label', value = 'Thing2')

thing1.append(thing2)

Out[17]: 
    label  0  1  2
0  Thing1  A  B  C
1  Thing1  D  E  F
2  Thing1  G  H  I
0  Thing2  Z  Y  X
1  Thing2  A  V  U
2  Thing2  M  L  P

EDIT TO ADDRESS COMMENT

There may be a more elegant way to do this (note the addition of reset_index here, which is necessary for subsequent slicing):

In [36]: thing3 = thing1.append(thing2).reset_index(drop = True)

In [37]: thing3
Out[37]: 
    label  0  1  2
0  Thing1  A  B  C # <-- slice from first 'A'
1  Thing1  D  E  F
2  Thing1  G  H  I
3  Thing2  Z  Y  X
4  Thing2  A  V  U # <-- to second 'A'
5  Thing2  M  L  P

In [38]: mask = thing3[0].between('A', 'A')

In [39]: mask
Out[39]: 
0     True
1    False
2    False
3    False
4     True
5    False
Name: 0, dtype: bool

In [40]: thing3[mask[mask].index[0]: mask[mask].index[1]]
Out[40]: 
    label  0  1  2
0  Thing1  A  B  C
1  Thing1  D  E  F
2  Thing1  G  H  I
3  Thing2  Z  Y  X

Or, if you'd like to include the last row, just add 1 to the slice:

In [41]: thing3[mask[mask].index[0]: mask[mask].index[1] + 1]
Out[41]: 
    label  0  1  2
0  Thing1  A  B  C
1  Thing1  D  E  F
2  Thing1  G  H  I
3  Thing2  Z  Y  X
4  Thing2  A  V  U
Jacob Miller
  • 516
  • 2
  • 5
  • 16
  • My knowledge is larger and I can work with this! Thanks, Jacob. If you have and insight into how to dynamically define the DataFrames based on the nature of the file. For example (bad example) "Thing 1's data frame consists of all the rows between the first row that starts with "thing" and the next encountered row that starts with thing. (and so on). If not, so what. I am very grateful for your kind knowledge share. – Mike G Feb 25 '21 at 02:38
  • @MikeG please accept if this answer is sufficient – Jacob Miller Mar 01 '21 at 15:22