Splitting a List inside a Pandas DataFrame

Question

I have a csv file that contains a number of columns. Using pandas, I read this csv file into a dataframe and have a datetime index and five or six other columns.

One of the columns is a list of timestamps (example below with index)

CreateDate     TimeStamps
4/1/11         [Timestamp('2012-02-29 00:00:00'), Timestamp('2012-03-31 00:00:00'), Timestamp('2012-04-25 00:00:00'), Timestamp('2012-06-30 00:00:00')]
4/2/11         [Timestamp('2014-01-31 00:00:00')]
6/8/11         [Timestamp('2012-08-31 00:00:00'), Timestamp('2012-09-30 00:00:00'), Timestamp('2012-11-07 00:00:00'), Timestamp('2013-01-10 00:00:00'), Timestamp('2013-07-25 00:00:00')]

What I'd like to do is convert the timestamp column into separate rows for each timestamp listed. For example, for row 1 it would convert to 4 rows and row 2 would convert to 1 row. I realize I'd need to reset the index to be able to do this, which is fine.

Everything I've tried just ends up getting out into left field (taking the values and create a list outside of pandas, etc)

Any suggestions appreciated.

How are you creating the dataframe such that you get lists in the TimeStamps column? — ari, Feb 10 '15 at 21:45
that's how the csv is getting sent to me. it's a datafile from quickbooks export. — Eric D. Brown D.Sc., Feb 10 '15 at 21:47

cwharland · Answer 1 · 2015-02-11T06:24:07.383

If you want to stay in pure pandas you can throw in a tricky groupby and apply which ends up boiling down to a one liner if you don't count the column rename.

In [1]: import pandas as pd

In [2]: d = {'date': ['4/1/11', '4/2/11'], 'ts': [[pd.Timestamp('2012-02-29 00:00:00'), pd.Timestamp('2012-03-31 00:00:00'), pd.Timestamp('2012-04-25 00:00:00'), pd.Timestamp('2012-06-30 00:00:00')], [pd.Timestamp('2014-01-31 00:00:00')]]}

In [3]: df = pd.DataFrame(d)

In [4]: df.head()
Out[4]: 
     date                                                 ts
0  4/1/11  [2012-02-29 00:00:00, 2012-03-31 00:00:00, 201...
1  4/2/11                              [2014-01-31 00:00:00]

In [5]: df_new = df.groupby('date').ts.apply(lambda x: pd.DataFrame(x.values[0])).reset_index().drop('level_1', axis = 1)

In [6]: df_new.columns = ['date','ts']

In [7]: df_new.head()
Out[7]: 
     date         ts
0  4/1/11 2012-02-29
1  4/1/11 2012-03-31
2  4/1/11 2012-04-25
3  4/1/11 2012-06-30
4  4/2/11 2014-01-31

Since the goal is to take the value of a column (in this case date) and repeat it for all values of the multiple rows you intend to create from the list it's useful to think of pandas indexing.

We want the date to become the single index for the new rows so we use groupby which puts the desired row value into an index. Then inside that operation I want to split only this list for this date which is what apply will do for us.

I'm passing apply a pandas Series which consists of a single list but I can access that list via a .values[0] which pushes the sole row of the Series to an array with a single entry.

To turn the list into a set of rows that will be passed back to the indexed date I can just make it a DataFrame. This incurs the penalty of picking up an extra index but we end up dropping that. We could make this an index itself but that would preclude dupe values.

Once this is passed back out I have a multi-index but I can force this into the row format we desire by reset_index. Then we simply drop the unwanted index.

It sounds involved but really we're just leverage the natural behaviors of pandas functions to avoid explicitly iterating or looping.

Speed wise this tends to be pretty good and since it relies on apply any parallelization tricks that work with apply work here.

Optionally if you want it to be robust to multiple dates each with a nested list:

df_new = df.groupby('date').ts.apply(lambda x: pd.DataFrame([item for sublist in x.values for item in sublist]))

at which point the one liner is getting dense and you should probably throw into a function.

ari · Answer 2 · 2015-02-10T22:57:15.490

The way I did it was split the list into seperate columns, and then melted it to put each timestamp in a separate row.

In [48]: df = pd.DataFrame([[1,2,[1,2,4]],[4,5,[1,3]],],columns=['a','b','TimeStamp'])
    ...: df
Out[48]: 
   a  b  TimeStamp
0  1  2  [1, 2, 4]
1  4  5     [1, 3]

You can convert the column to a list and then back to a DataFrame to split it into columns:

In [53]: TScolumns = pd.DataFrame(df.TimeStamp.tolist(), )
    ...: TScolumns
Out[53]: 
   0  1   2
0  1  2   4
1  1  3 NaN

And then splice it onto the original dataframe

In [90]: df = df.drop('TimeStamp',axis=1)
In [58]: split = pd.concat([df, TScolumns], axis=1)
    ...: split
Out[58]: 
   a  b  0  1   2
0  1  2  1  2   4
1  4  5  1  3 NaN

Finally, use melt to get it into the shape you want:

In [89]: pd.melt(split, id_vars=['a', 'b'], value_name='TimeStamp')
Out[89]: 
   a  b variable  TimeStamp
0  1  2        0          1
1  4  5        0          1
2  1  2        1          2
3  4  5        1          3
4  1  2        2          4
5  4  5        2        NaN

nick_eu · Accepted Answer · 2015-02-11T18:17:55.070

1

This doesn't feel very pythonic, but it works (provided your createDate is unique!)

Apply will only return more rows than it gets with a groupby, so we're going to use groupby artificially (i.e. groupby a column of unique values, so each group is one line).

def splitRows(x):

    # Extract the actual list of time-stamps. 
    theList = x.TimeStamps.iloc[0]

    # Each row will be a dictionary in this list.
    listOfNewRows = list()

    # Iterate over items in list of timestamps, 
    # putting each one in a dictionary to later convert to a row, 
    # then adding the dictionary to a list. 

    for i in theList:
        newRow = dict()
        newRow['CreateDate'] = x.CreateDate.iloc[0]
        newRow['TimeStamps'] = i
        listOfNewRows.append(newRow)

    # Now convert these dictionaries into rows in a new dataframe and return it. 
    return pd.DataFrame(listOfNewRows)


df.groupby('CreateDate', as_index = False, group_keys = False).apply(splitRows)

Followup: If CreateDate is NOT unique, you can just reset the index to a new column and groupby that.

edited Feb 11 '15 at 18:17

answered Feb 10 '15 at 22:32

nick_eu

3,541
5
24
38

1

Unless I'm reading this wrong (and implementing it wrong), this returns a dataframe with rows that are just a counter. CreateDate is correct but the Timestamp row is just a counter "i". When I try this, i get the Timestamps columns to be a sequence of numbers from 1 to x (with x being how many rows I have) – Eric D. Brown D.Sc. Feb 11 '15 at 14:21
1

you are correct, my apologies -- added a typo when I moved to SO. the "newRow['TimeStamps'] = i" should read "newRow['TimeStamps'] = theList[i]". Editing now. – nick_eu Feb 11 '15 at 18:17
Or the iterator should just be over theList, as it now shows. – nick_eu Feb 11 '15 at 18:20
Thanks...will take a look at it this evening – Eric D. Brown D.Sc. Feb 11 '15 at 21:49
This worked when I added a "split' to the theList row. I changed that line to be theList = x.TimeStamps.iloc[0].split(). Thanks. – Eric D. Brown D.Sc. Feb 12 '15 at 15:48

score 1 · Answer 4 · answered May 20 '21 at 19:15

A newer way to do this would be to use explode (documentation)

import pandas as pd

d = {'date': ['4/1/11', '4/2/11'], 'ts': [[pd.Timestamp('2012-02-29 00:00:00'), pd.Timestamp('2012-03-31 00:00:00'), pd.Timestamp('2012-04-25 00:00:00'), pd.Timestamp('2012-06-30 00:00:00')], [pd.Timestamp('2014-01-31 00:00:00')]]}

test_df = pd.DataFrame(d)

result_df = test_df.explode('ts')

result_df.head()

Output

    date    ts
0   4/1/11  2012-02-29
0   4/1/11  2012-03-31
0   4/1/11  2012-04-25
0   4/1/11  2012-06-30
1   4/2/11  2014-01-31

score 0 · Answer 5 · answered Feb 10 '15 at 23:25

Probably not the best way from performance perspective, but still, you can leverage itertools package:

from pandas import DataFrame, Timestamp
import itertools

d = {'date': ['4/1/11', '4/2/11'], 'ts': [[Timestamp('2012-02-29 00:00:00'), Timestamp('2012-03-31 00:00:00'), Timestamp('2012-04-25 00:00:00'), Timestamp('2012-06-30 00:00:00')], [Timestamp('2014-01-31 00:00:00')]]}
df = DataFrame(d)

res = df.to_dict()
data = []
for x in res['date'].keys():
  data.append(itertools.izip_longest([res['date'][x]], res['ts'][x], fillvalue=res['date'][x]))

new_data = list(itertools.chain.from_iterable(data))
df2 = DataFrame(new_data, columns=['date', 'timestamp'])
print df2

Will print :

     date  timestamp
0  4/1/11 2012-02-29
1  4/1/11 2012-03-31
2  4/1/11 2012-04-25
3  4/1/11 2012-06-30
4  4/2/11 2014-01-31

Splitting a List inside a Pandas DataFrame

5 Answers5

Linked