If you want to stay in pure pandas you can throw in a tricky groupby
and apply
which ends up boiling down to a one liner if you don't count the column rename.
In [1]: import pandas as pd
In [2]: d = {'date': ['4/1/11', '4/2/11'], 'ts': [[pd.Timestamp('2012-02-29 00:00:00'), pd.Timestamp('2012-03-31 00:00:00'), pd.Timestamp('2012-04-25 00:00:00'), pd.Timestamp('2012-06-30 00:00:00')], [pd.Timestamp('2014-01-31 00:00:00')]]}
In [3]: df = pd.DataFrame(d)
In [4]: df.head()
Out[4]:
date ts
0 4/1/11 [2012-02-29 00:00:00, 2012-03-31 00:00:00, 201...
1 4/2/11 [2014-01-31 00:00:00]
In [5]: df_new = df.groupby('date').ts.apply(lambda x: pd.DataFrame(x.values[0])).reset_index().drop('level_1', axis = 1)
In [6]: df_new.columns = ['date','ts']
In [7]: df_new.head()
Out[7]:
date ts
0 4/1/11 2012-02-29
1 4/1/11 2012-03-31
2 4/1/11 2012-04-25
3 4/1/11 2012-06-30
4 4/2/11 2014-01-31
Since the goal is to take the value of a column (in this case date) and repeat it for all values of the multiple rows you intend to create from the list it's useful to think of pandas indexing.
We want the date to become the single index for the new rows so we use groupby
which puts the desired row value into an index. Then inside that operation I want to split only this list for this date which is what apply
will do for us.
I'm passing apply
a pandas Series
which consists of a single list but I can access that list via a .values[0]
which pushes the sole row of the Series
to an array with a single entry.
To turn the list into a set of rows that will be passed back to the indexed date I can just make it a DataFrame
. This incurs the penalty of picking up an extra index but we end up dropping that. We could make this an index itself but that would preclude dupe values.
Once this is passed back out I have a multi-index but I can force this into the row format we desire by reset_index
. Then we simply drop the unwanted index.
It sounds involved but really we're just leverage the natural behaviors of pandas functions to avoid explicitly iterating or looping.
Speed wise this tends to be pretty good and since it relies on apply
any parallelization tricks that work with apply
work here.
Optionally if you want it to be robust to multiple dates each with a nested list:
df_new = df.groupby('date').ts.apply(lambda x: pd.DataFrame([item for sublist in x.values for item in sublist]))
at which point the one liner is getting dense and you should probably throw into a function.