Efficiently converting a list of dictionaries with embedded arrays to DataFrame

Question

I have a list of dictionaries, with spectral data inside the responses field. I also have array of wavelengths for labelling the columns for the spectral data. The list/input looks like this:

data = [  {
    'date': '2018-01-01',
    'measurement': 100,
    'responses': [(1, 1, np.array([1, 2, 3])),
                (2, 1, np.array([4, 5, 6])),
               ]
   },
   {
    'date': '2018-01-02',
    'measurement': 200,
    'responses': [(3, 1,np.array([5, 6, 7])),
                (4, 1, np.array([8, 9, 10])),
               ]
   },
]

And the column names for wavelengths to match:

wavelengths = [400,401,402]

I would like to convert this list to two pandas dataframes;

One where the array in the response tuples are averaged, and
One where they are separate, where the first to numbers in the tuple in responses are included.

The desired output for both is below:

__Average Dataframe__
index | date           | measurement | 400   | 401   | 402   |
0     | '2018-01-01'   | 100         | 2.5   | 3.5   | 4.5   |
1     | '2018-01-02'   | 200         | 6.5   | 7.5   | 8.5   |


__Seperate Dataframe__
index | date           | measurement | prong | scan| 400 | 401 | 402 |
0     | '2018-01-01'   | 100         | 1     | 1   | 1   | 2   | 3   |
1     | '2018-01-01'   | 100         | 2     | 1   | 4   | 5   | 6   |
2     | '2018-01-02'   | 200         | 3     | 1   | 5   | 6   | 7   |
3     | '2018-01-02'   | 200         | 4     | 1   | 8   | 9   | 10  |

What is the most efficient way to do this in pandas?

Please provide some reproducible input and expected output of it. — Chris, Mar 15 '19 at 04:22
For starters, you can provide a valid `dict`. It is just a format not a proper `dict`. And since having 1100 floats or 3 floats would require similar functionality, you can provide shorter `responses` with pseudo values, perhaps much shorter one, for much precise example. Going through [Minimal, Complete and Verifiable example](http://stackoverflow.com/help/mcve) can help you making a better question :) — Chris, Mar 15 '19 at 04:37
@Chris Thanks Chris, I've updated the question, hopefully, this is more readable! — Dave, Mar 15 '19 at 04:58

score 0 · Answer 1 · answered Mar 15 '19 at 05:38

This might not be the ideal solution, but using pandas:

import pandas as pd

wavelengths= [400, 401,402]
cols = ['prong', 'scan'] + wavelengths 

df1 = pd.concat([pd.DataFrame(dict([(k, pd.Series(v)) for k, v in d.items()])) for d in data])
df1 = df1.reset_index(drop=True).ffill()
df2 = pd.DataFrame([(x,y,*z) for x,y,z in df1['responses']], columns= cols)

df = pd.concat([df1.drop('responses', 1), df2], 1)
# Separate dataset
         date  measurement  prong  scan  400  401  402
0  2018-01-01        100.0      1     1    1    2    3
1  2018-01-01        100.0      2     1    4    5    6
2  2018-01-02        200.0      3     1    5    6    7
3  2018-01-02        200.0      4     1    8    9   10

df.groupby(['date', 'measurement'])[wavelengths].mean().reset_index()
# Average dataset
         date  measurement  400  401  402
0  2018-01-01        100.0  2.5  3.5  4.5
1  2018-01-02        200.0  6.5  7.5  8.5

Efficiently converting a list of dictionaries with embedded arrays to DataFrame

1 Answers1