2

I have multiple timeseries that are outputs of various algorithms. These algorithms can have various parameters and they produce timeseries as a result:

timestamp1=1;
value1=5;
timestamp2=2;
value2=8;
timestamp3=3;
value3=4;
timestamp4=4;
value4=12;

resultsOfAlgorithms=[
{
'algorithm':'minmax',
'param-a':'12',
'param-b':'200',
'result-of-algorithm':[[timestamp1,value1],[timestamp2,value2]]
},
{
'algorithm':'minmax',
'param-a':'12',
'param-b':'30',
'result-of-algorithm':[[timestamp1,value1],[timestamp3,value3]]
},
{
'algorithm':'minmax',
'param-a':'12',
'param-b':'30',
'result-of-algorithm':[[timestamp2,value2],[timestamp4,value4]]
},
{
'algorithm':'delta',
'param-a':'12',
'param-b':'50',
'result-of-algorithm':[[timestamp2,value2],[timestamp4,value4]]
}
]

I would like to be able to filter the timeseries by algorithm and parameters and plot filtered timeseries to see how given parameters affect the output. To do that I need to know all the occurring values for given parameter and then to be able to select timeseries with desired parameters. E.g. I would like to plot all results of minmax algorithm with param-b==30. There are 2 results that were produced with minmax algorithm and param-b==30. Thus I would like to have a plot with 2 timeseries in it.

Is this possible with pandas or is this out of pandas functionality? How could this be implemented?

Edit: Searching more the internet I think I am looking for a way to use hierarchical indexing. Also the timeseries should stay separated. Each result is a an individual time-series. It should not be merged together with other result. I need to filter the results of algorithms by parameters used. The result of filter should be still a list of timeseries.

Edit 2: There are multiple sub-problems:

  1. Find all existing values for each parameter (user does not know all the values since parameters can be auto-generated by system)

  2. user selects some of values for filtering One way this could be provided by user is a dictionary (but more-user friendly ideas are welcome):

    filter={ 'param-b':[30,50], 'algorithm':'minmax' }

  3. Timeseries from resultsOfAlgorithms[1:2] (2nd and 3rd result) are given as a result of filtering, since these results were produced by minmax algorithm and param-b was 30. Thus in this case

    [ [[timestamp1,value1],[timestamp3,value3]], [[timestamp1,value1],[timestamp3,value3]] ]

  4. The result of filtering will return multiple time series, which I want to plot and compare.

  5. user wants to try various filters to see how they affect results

I am doing all this in Jupyter notebook. And I would like to allow user to try various filters with the least hassle possible.

Timestamps in results are not shared. Timestamps between results are not necessarily shared. E.g. all timeseries might occur between 1pm and 3 pm and have roundly same amount of values but the timestamps nor the amount of values are not identical.

Marcel
  • 1,084
  • 2
  • 15
  • 28
  • Do you want to store the time series separately (in a list or, just as they are here, in `resultsOfAlgorithms`)? Or wouldn't it be better to keep them as columns in a data frame? If there is some natural, common timestamp space then the latter is more convenient. The columns would be hierarchically indexed and iterating over selected series would be probably much easier. – ptrj Jul 06 '16 at 01:04
  • What do you mean by "the timeseries should stay separated" in the context of wanting to filter by algorithm and parameters? Maybe an example of how your above data should look in the final result would be helpful. – Jeff Jul 06 '16 at 01:17
  • The second and third items in `resultsOfAlgorithms` have identical parameters but different `result-of-algorithm`, so timeseries are not uniquely determined by their parameters. Is it correct? As for a shared timestamp space, I thought a union of all timestamps might be sensible if timestamps of the series overlap. (The timeseries would contain NaN's for missing ones). – ptrj Jul 06 '16 at 16:49

1 Answers1

1

So there are two options here, one is to clean up the dict first, then convert it easily to a dataframe, the second is to convert it to a dataframe, then clean up the column that will have nested lists in it. For the first solution, you can just restructure the dict like this:

import pandas as pd
from collections import defaultdict

data = defaultdict(list)
for roa in resultsOfAlgorithms:
    for i in range(len(roa['result-of-algorithm'])):
        data['algorithm'].append(roa['algorithm'])
        data['param-a'].append(roa['param-a'])
        data['param-b'].append(roa['param-b'])
        data['time'].append(roa['result-of-algorithm'][i][0])
        data['value'].append(roa['result-of-algorithm'][i][1])

df = pd.DataFrame(data)

In [31]: df
Out[31]:
  algorithm param-a param-b  time  value
0    minmax      12     200     1      5
1    minmax      12     200     2      8
2    minmax      12      30     1      5
3    minmax      12      30     3      4
4    minmax      12      30     2      8
5    minmax      12      30     4     12
6     delta      12      50     2      8
7     delta      12      50     4     12

And from here you can do whatever analysis you need with it, whether it's plotting or making the time column the index or grouping and aggregating, and so on. You can compare this to making a dataframe first in this link:

Splitting a List inside a Pandas DataFrame

Where they basically did the same thing, with splitting a column of lists into multiple rows. I think fixing the dictionary will be easier though, depending on how representative your fairly simple example is of the real data.

Edit: If you wanted to turn this into a multi-index, you can add one more line:

df_mi = df.set_index(['algorithm', 'param-a', 'param-b'])

In [25]: df_mi
Out[25]:
                           time  value
algorithm param-a param-b
minmax    12      200         1      5
                  200         2      8
                  30          1      5
                  30          3      4
                  30          2      8
                  30          4     12
delta     12      50          2      8
                  50          4     12
Community
  • 1
  • 1
Jeff
  • 2,158
  • 1
  • 16
  • 29
  • This seems to merge/concatenate the time series - something OP wanted to avoid – ptrj Jul 06 '16 at 01:06
  • That was edited into the original question after my answer. Regardless, if you do it this way you can just use `groupby` to examine the different time series independently. – Jeff Jul 06 '16 at 01:15