1

I have a pandas dataframe:

import pandas as pd
import numpy as np
d={'ID':['A1','A1','A2','A2','A2'], 'date':['Jan 1','Jan7','Jan4','Jan5','Jan12'],'value':[10,12,3,5,2]}
df=pd.DataFrame(data=d)
df

    ID  date    value
0   A1  Jan 1   10
1   A1  Jan7    12
2   A2  Jan4    3
3   A2  Jan5    5
4   A2  Jan12   2
...
An

I would like to reshape it so that the date column becomes the main column, ranging from min(date) to max(date) sequentially (in this case Jan 1 to Jan 12). Each extra column would be a separate ID, and the values would correspond to the respective dates. I would also like to 'fill' the blanks with np.linspace(value at first date, value at next date, number of days), and any date from the first date to the first entry remains constant.

In the end, I would like:

f={'date':['Jan1','Jan2','Jan3','Jan4','Jan5','Jan6','Jan7','Jan8','Jan9','Jan10','Jan11','Jan12'],
  'A1':[10,10.3,10.7,11,11.3,11.7,12,12,12,12,12,12],
  'A2':[3,3,3,3,5,4.6,4.1,3.7,3.3,2.9,2.4,2]}
df2=pd.DataFrame(data=f)
df2

    date    A1      A2 ... An
0   Jan1    10.0    3.0
1   Jan2    10.3    3.0
2   Jan3    10.7    3.0
3   Jan4    11.0    3.0
4   Jan5    11.3    5.0
5   Jan6    11.7    4.6
6   Jan7    12.0    4.1
7   Jan8    12.0    3.7
8   Jan9    12.0    3.3
9   Jan10   12.0    2.9
10  Jan11   12.0    2.4
11  Jan12   12.0    2.0

I tried used numpy.linspace to generate the sequence, but when I try to append the extra values I get no value:

A2 = [np.linspace(10,12,10)].append([[12]*4])

and

np.linspace(10,12,8).append([[12]*5])

gives me:

AttributeError: 'numpy.ndarray' object has no attribute 'append'

I was thinking of looping through specific elements, but do not know how to join it all together at the end.

Any suggestions.

frank
  • 3,036
  • 7
  • 33
  • 65

1 Answers1

2

IIUC, it's pivot and interpolate:

df['date'] = pd.to_datetime(df['date'], format='%b%d')

(df.pivot(index='date', columns='ID',values='value')
   .asfreq('D')
   .interpolate()
   .bfill()
   .reset_index()
)

Output:

ID       date         A1        A2
0  1900-01-01  10.000000  3.000000
1  1900-01-02  10.333333  3.000000
2  1900-01-03  10.666667  3.000000
3  1900-01-04  11.000000  3.000000
4  1900-01-05  11.333333  5.000000
5  1900-01-06  11.666667  4.571429
6  1900-01-07  12.000000  4.142857
7  1900-01-08  12.000000  3.714286
8  1900-01-09  12.000000  3.285714
9  1900-01-10  12.000000  2.857143
10 1900-01-11  12.000000  2.428571
11 1900-01-12  12.000000  2.000000

Note that you want to convert to datetime since you will interpolate the values.

If you choose to convert it back, you can use:

df.index = df.index.strftime('%b%d')

Output:

ID            A1        A2
Jan01  10.000000  3.000000
Jan02  10.333333  3.000000
Jan03  10.666667  3.000000
Jan04  11.000000  3.000000
Jan05  11.333333  5.000000
Jan06  11.666667  4.571429
Jan07  12.000000  4.142857
Jan08  12.000000  3.714286
Jan09  12.000000  3.285714
Jan10  12.000000  2.857143
Jan11  12.000000  2.428571
Jan12  12.000000  2.000000
Erfan
  • 40,971
  • 8
  • 66
  • 78
Quang Hoang
  • 146,074
  • 10
  • 56
  • 74