Create common columns and transform time series like data

Question

I have an excel sheet which contains more than 30 sheets for different parameters like BP, Heart rate etc.

One of the dataframe (df1 - created from one sheet of excel) looks like as shown below

df1= pd.DataFrame({'person_id':[1,1,1,1,2,2,2,2,3,3,3,3,3,3],'level_1': ['H1Date','H1','H2Date','H2','H1Date','H1','H2Date','H2','H1Date','H1','H2Date','H2','H3Date','H3'],
               'values': ['2006-10-30 00:00:00','6.6','2006-08-30 00:00:00','4.6','2005-10-30 00:00:00','6.9','2016-11-30 00:00:00','6.6','2006-10-30 00:00:00','6.6','2006-11-30 00:00:00','8.6',
                       '2106-10-30 00:00:00','16.6']})

Another dataframe (df2) from another sheet of excel file can be generated using the code below

df2= pd.DataFrame({'person_id':[1,1,1,1,2,2,2,2,3,3,3,3,3,3],'level_1': ['GluF1Date','GluF1','GluF2Date','GluF2','GluF1Date','GluF1','GluF2Date','GluF2','GluF1Date','GluF1','GluF2Date','GluF2','GluF3Date','GluF3'],
               'values': ['2006-10-30 00:00:00','6.6','2006-08-30 00:00:00','4.6','2005-10-30 00:00:00','6.9','2016-11-30 00:00:00','6.6','2006-10-30 00:00:00','6.6','2006-11-30 00:00:00','8.6',
                       '2106-10-30 00:00:00','16.6']})

Similarly there are more than 30 dataframes like this with values of the same format (Date & measurement value) but column names (H1, GluF1, H1Date,H100,H100Date, GluF1Date,P1,PDate,UACRDate,UACR100, etc) are different

What I am trying to do based on SO search is as shown below

g = df1.level_1.str[-2:] # Extracting column names
    df1['lvl'] = df1.level_1.apply(lambda x: int(''.join(filter(str.isdigit, x)))) # Extracting level's number
    df1= df1.pivot_table(index=['person_id', 'lvl'], columns=g, values='values', aggfunc='first')
    final = df1.reset_index(level=1).drop(['lvl'], axis=1)

The above code gives an output like this which is not expected

This doesn't work as g doesn't result in same string output (column names) for all records. My code would work if the substring extract has resulted in same output but since the data is like sequence, I am not able to make it uniform

I expect my output to be like as shown below for each dataframe. Please note that a person can have 3 records (H1..H3)/10 records (H1..H10) / 100 records (ex: H1...H100). It is all possible.

updated screenshot

Stef · Accepted Answer · 2019-07-30T08:36:21.353

1

Concat all even and all odd rows without using column names, then name the columns as needed:

res = pd.concat([df2.iloc[0::2,0:3:2].reset_index(drop=True), df2.iloc[1::2,2].reset_index(drop=True)], axis=1)
res.columns = ['Person_ID', 'Date', 'Value']

Output:

   Person_ID                 Date Value
0          1  2006-10-30 00:00:00   6.6
1          1  2006-08-30 00:00:00   4.6
2          2  2005-10-30 00:00:00   6.9
3          2  2016-11-30 00:00:00   6.6
4          3  2006-10-30 00:00:00   6.6
5          3  2006-11-30 00:00:00   8.6
6          3  2106-10-30 00:00:00  16.6

edited Jul 30 '19 at 08:36

answered Jul 30 '19 at 07:44

Stef

28,728
2
24
52

Trying your answer. – The Great Jul 30 '19 at 07:49
Can you help me understand what does `0::2,0:3:2` does? I did execute but couldn't understand how the result is generated? I did refer doc but it was only one index like df2.iloc[0:2] – The Great Jul 30 '19 at 08:39
1

0::2 = take every second, beginning with 0 up to end; 0:3:2 = take every second up to including 2 (=3-1, the 3 is redundand in this case as you have just 3 columns) – Stef Jul 30 '19 at 08:43
Sorry, I didn't get this. Is there any doc that you can share to refer this combination of indexing/slicing? I understand it might simple but somehow I don't get this. We have 0:3:2 which when added drops the column `level_1`. I am able to understand `0::2` but not `0:3:2` – The Great Jul 30 '19 at 08:49
Okay, does `:2` in `0:3:2` indicates that 2nd column will be dropped? – The Great Jul 30 '19 at 08:52
`0:3:3` retains only the `person_id` column. Rest of the columns are dropped. Am confused only with this part – The Great Jul 30 '19 at 08:55
[iloc](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#selection-by-position) doesn't give an example with 3 numbers, but there are examples at [slicing ranges](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#slicing-ranges). – Stef Jul 30 '19 at 08:55
there is no `0:3:3` in my answer – Stef Jul 30 '19 at 08:58
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/197206/discussion-between-avles-and-stef). – The Great Jul 30 '19 at 08:59

anky · Answer 2 · 2019-07-30T07:57:27.980

1

Here is one way using unstack() with a little modification:

Assign a dummy column using ,df1.groupby(['person_id',df1.level_1.str[:2]]).cumcount()

Change level_1 to level_1=df1.level_1.str[:2]

Set index as ['person_id','level_1','k'] and unstack on the 3rd index.

m=(df1.assign(k=df1.groupby(['person_id',df1.level_1.str[:2]]).cumcount()
        ,level_1=df1.level_1.str[:2]).
  set_index(['person_id','level_1','k']).unstack(2)).droplevel(1)
m.columns=['Date','Values']
print(m)

                         Date Values
person_id                            
1          2006-10-30 00:00:00    6.6
1          2006-08-30 00:00:00    4.6
2          2005-10-30 00:00:00    6.9
2          2016-11-30 00:00:00    6.6
3          2006-10-30 00:00:00    6.6
3          2006-11-30 00:00:00    8.6
3          2106-10-30 00:00:00   16.6

edited Jul 30 '19 at 07:57

answered Jul 30 '19 at 07:51

anky

74,114
11
41
70

1

Trying your answer. Thank you – The Great Jul 30 '19 at 08:04
A question, when I have H100 for a person, will this work still? we take only str[:2]. columns don't usually follow the same format. Anyway I am trying at my end as well – The Great Jul 30 '19 at 08:14
For certain records, it doesn't do the transform. I mean it creates 22 columns as shown in screenshot above – The Great Jul 30 '19 at 08:21
Records are converted as 9 individual rows for person_id = 3 (because he has only till H9) whereas for person_id = 4, it is creating 9 rows till H9 and H10 again starts from the 1st row as a new column (H10 cropped to become H1). I think it works fine when it is from H1..H9 but anything after H10 causes the issue – The Great Jul 30 '19 at 08:28
The above comment is based on my real data. Person_id = 3 has 9 datapoints and it is being correctly converted as 9 rows whereas for anything more than 9 datatpoints, it creates an issue – The Great Jul 30 '19 at 08:40
Can you help me with this post? https://stackoverflow.com/questions/57285680/dont-drop-all-nas-during-stack-operation-in-python – The Great Jul 31 '19 at 07:52

Create common columns and transform time series like data

2 Answers2

Linked