0

I have an excel sheet which contains more than 30 sheets for different parameters like BP, Heart rate etc.

One of the dataframe (df1 - created from one sheet of excel) looks like as shown below

df1= pd.DataFrame({'person_id':[1,1,1,1,2,2,2,2,3,3,3,3,3,3],'level_1': ['H1Date','H1','H2Date','H2','H1Date','H1','H2Date','H2','H1Date','H1','H2Date','H2','H3Date','H3'],
               'values': ['2006-10-30 00:00:00','6.6','2006-08-30 00:00:00','4.6','2005-10-30 00:00:00','6.9','2016-11-30 00:00:00','6.6','2006-10-30 00:00:00','6.6','2006-11-30 00:00:00','8.6',
                       '2106-10-30 00:00:00','16.6']})

enter image description here

Another dataframe (df2) from another sheet of excel file can be generated using the code below

df2= pd.DataFrame({'person_id':[1,1,1,1,2,2,2,2,3,3,3,3,3,3],'level_1': ['GluF1Date','GluF1','GluF2Date','GluF2','GluF1Date','GluF1','GluF2Date','GluF2','GluF1Date','GluF1','GluF2Date','GluF2','GluF3Date','GluF3'],
               'values': ['2006-10-30 00:00:00','6.6','2006-08-30 00:00:00','4.6','2005-10-30 00:00:00','6.9','2016-11-30 00:00:00','6.6','2006-10-30 00:00:00','6.6','2006-11-30 00:00:00','8.6',
                       '2106-10-30 00:00:00','16.6']})

Similarly there are more than 30 dataframes like this with values of the same format (Date & measurement value) but column names (H1, GluF1, H1Date,H100,H100Date, GluF1Date,P1,PDate,UACRDate,UACR100, etc) are different

What I am trying to do based on SO search is as shown below

g = df1.level_1.str[-2:] # Extracting column names
    df1['lvl'] = df1.level_1.apply(lambda x: int(''.join(filter(str.isdigit, x)))) # Extracting level's number
    df1= df1.pivot_table(index=['person_id', 'lvl'], columns=g, values='values', aggfunc='first')
    final = df1.reset_index(level=1).drop(['lvl'], axis=1)

The above code gives an output like this which is not expected

enter image description here

This doesn't work as g doesn't result in same string output (column names) for all records. My code would work if the substring extract has resulted in same output but since the data is like sequence, I am not able to make it uniform

I expect my output to be like as shown below for each dataframe. Please note that a person can have 3 records (H1..H3)/10 records (H1..H10) / 100 records (ex: H1...H100). It is all possible.

enter image description here

updated screenshot

enter image description here

The Great
  • 7,215
  • 7
  • 40
  • 128

2 Answers2

1

Concat all even and all odd rows without using column names, then name the columns as needed:

res = pd.concat([df2.iloc[0::2,0:3:2].reset_index(drop=True), df2.iloc[1::2,2].reset_index(drop=True)], axis=1)
res.columns = ['Person_ID', 'Date', 'Value']

Output:

   Person_ID                 Date Value
0          1  2006-10-30 00:00:00   6.6
1          1  2006-08-30 00:00:00   4.6
2          2  2005-10-30 00:00:00   6.9
3          2  2016-11-30 00:00:00   6.6
4          3  2006-10-30 00:00:00   6.6
5          3  2006-11-30 00:00:00   8.6
6          3  2106-10-30 00:00:00  16.6
Stef
  • 28,728
  • 2
  • 24
  • 52
  • Trying your answer. – The Great Jul 30 '19 at 07:49
  • Can you help me understand what does `0::2,0:3:2` does? I did execute but couldn't understand how the result is generated? I did refer doc but it was only one index like df2.iloc[0:2] – The Great Jul 30 '19 at 08:39
  • 1
    0::2 = take every second, beginning with 0 up to end; 0:3:2 = take every second up to including 2 (=3-1, the 3 is redundand in this case as you have just 3 columns) – Stef Jul 30 '19 at 08:43
  • Sorry, I didn't get this. Is there any doc that you can share to refer this combination of indexing/slicing? I understand it might simple but somehow I don't get this. We have 0:3:2 which when added drops the column `level_1`. I am able to understand `0::2` but not `0:3:2` – The Great Jul 30 '19 at 08:49
  • Okay, does `:2` in `0:3:2` indicates that 2nd column will be dropped? – The Great Jul 30 '19 at 08:52
  • `0:3:3` retains only the `person_id` column. Rest of the columns are dropped. Am confused only with this part – The Great Jul 30 '19 at 08:55
  • [iloc](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#selection-by-position) doesn't give an example with 3 numbers, but there are examples at [slicing ranges](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#slicing-ranges). – Stef Jul 30 '19 at 08:55
  • there is no `0:3:3` in my answer – Stef Jul 30 '19 at 08:58
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/197206/discussion-between-avles-and-stef). – The Great Jul 30 '19 at 08:59
1

Here is one way using unstack() with a little modification:

Assign a dummy column using ,df1.groupby(['person_id',df1.level_1.str[:2]]).cumcount()

Change level_1 to level_1=df1.level_1.str[:2]

Set index as ['person_id','level_1','k'] and unstack on the 3rd index.

m=(df1.assign(k=df1.groupby(['person_id',df1.level_1.str[:2]]).cumcount()
        ,level_1=df1.level_1.str[:2]).
  set_index(['person_id','level_1','k']).unstack(2)).droplevel(1)
m.columns=['Date','Values']
print(m)

                         Date Values
person_id                            
1          2006-10-30 00:00:00    6.6
1          2006-08-30 00:00:00    4.6
2          2005-10-30 00:00:00    6.9
2          2016-11-30 00:00:00    6.6
3          2006-10-30 00:00:00    6.6
3          2006-11-30 00:00:00    8.6
3          2106-10-30 00:00:00   16.6
anky
  • 74,114
  • 11
  • 41
  • 70
  • 1
    Trying your answer. Thank you – The Great Jul 30 '19 at 08:04
  • A question, when I have H100 for a person, will this work still? we take only str[:2]. columns don't usually follow the same format. Anyway I am trying at my end as well – The Great Jul 30 '19 at 08:14
  • For certain records, it doesn't do the transform. I mean it creates 22 columns as shown in screenshot above – The Great Jul 30 '19 at 08:21
  • Records are converted as 9 individual rows for person_id = 3 (because he has only till H9) whereas for person_id = 4, it is creating 9 rows till H9 and H10 again starts from the 1st row as a new column (H10 cropped to become H1). I think it works fine when it is from H1..H9 but anything after H10 causes the issue – The Great Jul 30 '19 at 08:28
  • The above comment is based on my real data. Person_id = 3 has 9 datapoints and it is being correctly converted as 9 rows whereas for anything more than 9 datatpoints, it creates an issue – The Great Jul 30 '19 at 08:40
  • Can you help me with this post? https://stackoverflow.com/questions/57285680/dont-drop-all-nas-during-stack-operation-in-python – The Great Jul 31 '19 at 07:52