6

Given a DataFrame with the following structure:

Date     | Site  | Measurement Type | Value
-----------------------------------------------
1/1/2020 | A     | Temperature      | 32.3
1/2/2020 | B     | Humidity         | 70%

I would like to create a 3D "pivot table" where the first axis represents site, the second represents date, the third represents measurement type, and values are stored in each element.

For example, if I had daily measurements for one week at 5 sites, measuring both Temperature and Humidity, the desired output would be an array with shape (5, 7, 2).

Pandas only seems to support creating 2D pivot tables, but I'm happy with just an unlabeled 3D numpy array as output. Wondering if there's an existing easy way to do this before I spend time implementing it myself.

LoLa
  • 1,260
  • 3
  • 12
  • 17
  • 5
    This sounds like basic pivotting with MultiIndex. Thats why it's important that you post your expected output as well, since you can approach this problem in multiple ways. – Erfan Dec 31 '19 at 17:08
  • something like `pd.crosstab(df.Date,columns=[df['Measurement Type'],df['Site']],values=df['Value'],aggfunc='first')` ? – Umar.H Dec 31 '19 at 17:10
  • @Erfan see edit - does that clarify the desired output structure? – LoLa Dec 31 '19 at 17:16
  • @LoLa IIUC, They meant an actual dataframe how the expected output looks like – anky Dec 31 '19 at 17:17
  • @anky_91 it's not a DataFrame though, just an array of values with those dimensions. – LoLa Dec 31 '19 at 17:19
  • 1
    You are making this more difficult than it is in my opinion. 3d arrays are hard to read. Pandas has solved this with `MultiIndex` which you can use to achieve multi dimensionality. Try this code: `df.pivot_table(index='Date', columns=['Site', 'Measurement Type'], values='Value', aggfunc=lambda x: x)` – Erfan Dec 31 '19 at 17:30
  • 1
    @Erfan I need the 3D structure as I am transforming the data for RNN training in Tensorflow. See the [diagram in this tutorial for the desired structure](https://www.tensorflow.org/tutorials/structured_data/time_series) I agree for readability a MultiIndex is superior. – LoLa Dec 31 '19 at 17:45

3 Answers3

4

It is doable using df.pivot_table. I added one more row to your sample to have both Measurement Type. On missing values, it will be represented by np.nan

sample `df`

       Date Site Measurement_Type Value
0  1/1/2020    A      Temperature  32.3
1  1/1/2020    A         Humidity   60%
2  1/2/2020    B         Humidity   70%

Try the followings

iix = pd.MultiIndex.from_product([np.unique(df.Date), np.unique(df.Measurement_Type)])
df_pivot = (df.pivot_table('Value', 'Site', ['Date', 'Measurement_Type'], aggfunc='first')
              .reindex(iix, axis=1))
arr = np.array(df_pivot.groupby(level=0, axis=1).agg(lambda x: [*x.values])
                       .to_numpy().tolist())

print(arr)

Out[1447]:
array([[['60%', '32.3'],
        [nan, nan]],

       [[nan, nan],
        ['70%', nan]]], dtype=object)

Method 2: using pivot_table on different columns and numpy reshape

iix_n = pd.MultiIndex.from_product([np.unique(df.Site), np.unique(df.Date)])
arr = (df.pivot_table('Value', ['Site', 'Date'], 'Measurement_Type', aggfunc='first')
         .reindex(iix_n).to_numpy()
         .reshape(df.Site.nunique(),df.Date.nunique(),-1))

Out[1501]:
array([[['60%', '32.3'],
        [nan, nan]],

       [[nan, nan],
        ['70%', nan]]], dtype=object)
Andy L.
  • 24,909
  • 4
  • 17
  • 29
  • This is exactly what I needed, though I wish it were possible in a simpler way. Thank you. I like how this can handle the missing data too. – LoLa Dec 31 '19 at 18:45
  • Natively, `pivot_table` doesn't support 3D-array output, so we need to do more processing to achieve it. Glad I could help :) – Andy L. Dec 31 '19 at 18:51
  • 1
    @LoLa: I added another method using `pivot_table` and `reshape`. It is simpler than the first one. Give it a try if you wan a simpler solution. Notice the different pivoted columns and different `reindex` – Andy L. Dec 31 '19 at 19:48
  • 1
    neat, I like this. The resulting array will be sorted across each axis (based on the original indices) correct? – LoLa Dec 31 '19 at 21:07
  • ...Yes, it is :) – Andy L. Dec 31 '19 at 21:24
1

I think what you are looking for is something like a panel. You could also just use a 3-dim numpy array. For example using panel:

p_dim = {}

# desired columns
cols = ['Site', 'Measurement Type']

for date in df.Date:
    sub_df = df[df.Date.isin([date])].reset_index(drop=True)
    p_dim[date] = sub_df[[c for c in sub_df.columns if c in cols]]

panel = pd.Panel(p_dim)

Now you can access the various data associated with a date using panel['1/1/2020'] assuming your Date column is type str. To see all available keys you can use panel.keys().

Denver
  • 629
  • 4
  • 6
  • 1
    I'm trying to avoid using `panel` because it is deprecated according to the pandas docs: https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.Panel.html – LoLa Dec 31 '19 at 17:44
  • Your solution is easy enough to adapt for what I need though, so I'll probably end up doing something similar if there is indeed no equivalent of `pivot_table()` available. – LoLa Dec 31 '19 at 17:54
1

I did a small script to measure perf of the different @Andy L. approaches.

Second method seems slightly faster :

def pivot_table_3d_1(df, col1, col2, col3, value, aggfunc='first') :
    iix = pd.MultiIndex.from_product([np.unique(df[col2]), np.unique(df[col3])])
    df_pivot = (df.pivot_table(value, col1, [col2,col3], aggfunc=aggfunc)
                  .reindex(iix, axis=1))
    arr = np.array(df_pivot.groupby(level=0, axis=1).agg(lambda x: [*x.values])
                           .to_numpy().tolist())

    return arr

def pivot_table_3d_2(df, col1, col2, col3, value, aggfunc='first') :
    iix_n = pd.MultiIndex.from_product([np.unique(df[col1]), np.unique(df[col2])])
    arr = (df.pivot_table(value, [col1, col2], col3, aggfunc=aggfunc)
         .reindex(iix_n).to_numpy()
         .reshape(df[col1].nunique(),df[col2].nunique(),-1))

    return arr

## TESTING
N1 = 100
N2 = 200
N3 = 300
df = pd.DataFrame({'col1': np.random.randint(0, N1, N1*N2*N3),
                   'col2': np.random.randint(0, N2, N1*N2*N3),
                   'col3': np.random.randint(0, N3, N1*N2*N3),
                   'value': np.random.normal(0,1,N1*N2*N3)})

%timeit pivot_table_3d(df, col1='col1', col2='col2', col3='col3', value='value')
# 10.2 s ± 39.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit pivot_table_3d_2(df, col1='col1', col2='col2', col3='col3', value='value')
#9.47 s ± 108 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Jean Lescut
  • 1,387
  • 1
  • 12
  • 17