3

Using the same example from here but just changing the 'A' column to be something that can easily be grouped by:

import pandas as pd
import numpy as np
# Get some time series data
df = pd.read_csv("https://raw.githubusercontent.com/plotly/datasets/master/timeseries.csv")
df["A"] = pd.Series([1]*3+ [2]*8)
df.head()

whose output now is:

         Date  A       B       C      D      E      F      G
0  2008-03-18  1  164.93  114.73  26.27  19.21  28.87  63.44
1  2008-03-19  1  164.89  114.75  26.22  19.07  27.76  59.98
2  2008-03-20  1  164.63  115.04  25.78  19.01  27.04  59.61
3  2008-03-25  2  163.92  114.85  27.41  19.61  27.84  59.41
4  2008-03-26  2  163.45  114.84  26.86  19.53  28.02  60.09
5  2008-03-27  2  163.46  115.40  27.09  19.72  28.25  59.62
6  2008-03-28  2  163.22  115.56  27.13  19.63  28.24  58.65

Doing the cumulative sums (code from the linked question) works well when we're assuming it's a single list:

# Put your inputs into a single list
input_cols = ["B", "C"]
df['single_input_vector'] = df[input_cols].apply(tuple, axis=1).apply(list)
# Double-encapsulate list so that you can sum it in the next step and keep time steps as separate elements
df['single_input_vector'] = df.single_input_vector.apply(lambda x: [list(x)])
# Use .cumsum() to include previous row vectors in the current row list of vectors
df['cumulative_input_vectors1'] = df["single_input_vector"].cumsum()

but how do I cumsum the lists in this case grouped by 'A'? I expected this to work but it doesnt:

df['cumu'] = df.groupby("A")["single_input_vector"].apply(lambda x: list(x)).cumsum()

Instead of [[164.93, 114.73, 26.27], [164.89, 114.75, 26.... I get some rows filled in, others are NaN's. This is what I want (cols [B,C] accumulated into groups of col A):

      A       cumu       
0     1    [[164.93,114.73], [164.89,114.75], [164.63,115.04]]
0     2    [[163.92,114.85], [163.45,114.84], [163.46,115.40], [163.22, 115.56]]  

Also, how do I do this in an efficient manner? My dataset is quite big (about 2 million rows).

scc
  • 10,342
  • 10
  • 51
  • 65
  • For context, imagine that 'A' is the user_id and the sequences are the sequences of events done by the user, which need to be grouped before being fed into a keras lstm model. – scc Nov 24 '19 at 01:59
  • Can you give an expected output for the 7 rows in your df – Kenan Nov 24 '19 at 03:28
  • @ksooklall I just did. Thanks for any help. – scc Nov 24 '19 at 10:31

1 Answers1

2

It doesn't look like your doing arithmetic sum, more like a concat along axis=1

First groupby and concat

temp_series = df.groupby('A').apply(lambda x: [[a,b] for a, b in zip(x['B'], x['C'])])

0    [[164.93, 114.73], [164.89, 114.75], [164.63, ...
1    [[163.92, 114.85], [163.45, 114.84], [163.46, ...

then convert back to a dataframe

df = temp_series.reset_index().rename(columns={0: 'cumsum'})

In one line

df = df.groupby('A').apply(lambda x: [[a,b] for a, b in zip(x['B'], x['C'])]).reset_index().rename(columns={0: 'cumsum'})

Kenan
  • 13,156
  • 8
  • 43
  • 50