6

I have a DataFrame that includes a column where every cell is made up of a list of dicts, and each list of dicts is of varying length (including 0).

An example:

df = pd.DataFrame({'ID' : [13423,294847,322844,429847], 'RANKS': [[{u'name': u'A', u'price': u'$1.00', u'rank': u'1'},
{u'name': u'B', u'price': u'$4.00', u'rank': u'2'},
{u'name': u'C', u'price': u'$3.99', u'rank': u'3'},
{u'name': u'D', u'price': u'$2.00', u'rank': u'4'},
{u'name': u'E', u'price': u'$2.50', u'rank': u'5'}],

[{u'name': u'AA', u'price': u'$1.99', u'rank': u'1'},
{u'name': u'BB', u'price': u'$6.99', u'rank': u'2'}],

[{u'name': u'Z', u'price': u'$0.99', u'rank': u'1'},
{u'name': u'Y', u'price': u'$10.00', u'rank': u'2'},
{u'name': u'X', u'price': u'$1.99', u'rank': u'3'}],[]], 'count' : [5,2,3,0]})

Note that 'count' is the number of dicts in 'RANKS.' The goal I had in mind was to creating a series of additional dataframes/tables (one for each 'rank') and link these to the main table in a HDFStore. Something like:

Rank_2
ID       Price   Name
13423    $4.00    B  
294847   $6.99    BB 
322844   $10.99   Y 
429847   NaN      NaN   


Rank_3
ID       Price   Name
13423    $3.99    C  
294847   NaN      NaN 
322844   $1.99    X 
429847   NaN      NaN   

This way I could easy query on ID and rank if needed, but the main table wouldn't get cluttered with the unwinding of this hierarchical data.

The problem, however, is I cannot figure out how to create the DataFrames from this column. I've tried a number of things, the first (to be nested in a for-loop if it worked, but of course it did not):

Rank_1 = pd.DataFrame(df.loc[df['count'] > 0]['RANKS'].map(lambda x: pd.DataFrame(x[0])))

And, the second, since price is the most important piece to me:

for i in range(0,5):
    df['rank_%s' % str(i+1)] = df[df['count'] > i]['RANKS'].map(lambda x: x[i]['price'].strip('$'))

Then convert to float. This works but is a pretty big compromise. Is there an efficient way (that won't get hung up on NaNs) to accomplish my goal of separate DataFrames for each rank?

Pedro M Duarte
  • 26,823
  • 7
  • 44
  • 43
James
  • 113
  • 1
  • 9
  • James, you should consider separating your post into two questions. One for splitting your dataframe into multiple data frames and for the HDFStore question. – Bob Haffner Feb 22 '15 at 16:34
  • Will do. Thank you, Bob. – James Feb 22 '15 at 18:17
  • My gut reaction is that you probably should *not* break apart your DataFrame into many smaller DataFrames. Handling lots of small DataFrames requires a Python loop which is usually a step along the path to slowness. Instead I think you might be better served with one DataFrame, which flattens the list of dicts so that each inner dict is given its own row in the DataFrame. The keys of the inner dict will become new columns. I suspect this single flat DataFrame format would be able to do anything the multiple DataFrame alternative could do but faster, and it would make saving to HDFStore simple. – unutbu Feb 22 '15 at 18:58
  • Link to second question (somewhat related post): http://stackoverflow.com/q/28662373/4586746 – James Feb 22 '15 at 19:44
  • @unutbu Thanks for the response. I am definitely somewhat in agreement. And you don't think flattening it as you suggested will cause any performance issues with an already large DataFrame (1.5M rows and ~9GB of ram used)? – James Feb 22 '15 at 19:48
  • It will actually use less memory. Native dtypes such as `np.float` and `np.int` require less space than Python `float`s and `ints`. And Python `dict`s require quite a few bytes as well. A Pandas DataFrame that contains a list of dicts uses an `object` dtype to store arbitrary Python objects. If you replace that list of dicts with columns with native NumPy/Pandas dtypes, then you will definitely save space. – unutbu Feb 22 '15 at 19:53
  • Since the current DataFrame requires 9GB of memory, I'm not sure if you have enough memory to build the new DataFrame in memory as well. If not, then it would be best to build the new DataFrame from your original data source without using your current DataFrame as an intermediate step. – unutbu Feb 22 '15 at 19:56
  • I have 32GB total memory available. I believe that should be enough. Would you mind providing some code/pseudo-code to accomplish this? I am happy to ask another question if that's more appropriate. – James Feb 22 '15 at 20:07
  • @James: I may have to eat my words regarding the claim that the flat DataFrame requires less memory. There are certain pieces of data such as the `count` and `ID` which is repeated on each row and this adds to the total memory usage where the nested DataFrame did not. So the flat DataFrame is not always smaller. – unutbu Feb 22 '15 at 20:23

3 Answers3

6

My gut reaction is that you probably should not break apart your DataFrame into many smaller DataFrames. Handling lots of small DataFrames requires a Python loop which is usually a step along the path to slowness. Instead I think you might be better served with one DataFrame, which flattens the list of dicts so that each inner dict is given its own row in the DataFrame. The keys of the inner dict will become new columns. I suspect this single flat DataFrame format would be able to do anything the multiple DataFrame alternative could do but faster, and it would make saving to HDFStore simple.

Suppose you have a DataFrame with a list of dicts in the RANKS column:

import numpy as np
import pandas as pd

df = pd.DataFrame({'ID' : [13423,294847,322844,429847], 'RANKS': [[{u'name': u'A', u'price': u'$1.00', u'rank': u'1'},
{u'name': u'B', u'price': u'$4.00', u'rank': u'2'},
{u'name': u'C', u'price': u'$3.99', u'rank': u'3'},
{u'name': u'D', u'price': u'$2.00', u'rank': u'4'},
{u'name': u'E', u'price': u'$2.50', u'rank': u'5'}],

[{u'name': u'AA', u'price': u'$1.99', u'rank': u'1'},
{u'name': u'BB', u'price': u'$6.99', u'rank': u'2'}],

[{u'name': u'Z', u'price': u'$0.99', u'rank': u'1'},
{u'name': u'Y', u'price': u'$10.00', u'rank': u'2'},
{u'name': u'X', u'price': u'$1.99', u'rank': u'3'}],[]], 'count' : [5,2,3,0]})

then you can build a flat DataFrame with one dict per row like this:

result = []
for idx, row in df.iterrows():
    for dct in row['RANKS']:
        dct['ID'] = row['ID']
        dct['count'] = row['count']
        result.append(dct)
del df
result = pd.DataFrame(result)
result['rank'] = result['rank'].astype(np.int32)
result['price'] = result['price'].str.replace('$', '')
result['price'] = result['price'].astype('float')
print(result)

which yields

       ID  count name  price  rank
0   13423      5    A   1.00     1
1   13423      5    B   4.00     2
2   13423      5    C   3.99     3
3   13423      5    D   2.00     4
4   13423      5    E   2.50     5
5  294847      2   AA   1.99     1
6  294847      2   BB   6.99     2
7  322844      3    Z   0.99     1
8  322844      3    Y  10.00     2
9  322844      3    X   1.99     3

Note that building result directly from the original data source (thus avoiding df altogether) would be a cleaner, less memory-demanding solution.

unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • Thanks a lot. I really appreciate the advice. I had a thought that I will try soon: To help the memory issue, I'll create a small dataframe that unwinds the nested column (per your code), then merge that on ID with the main dataframe. That way, there is no duplicate 'df,' just an additional dataframe comprised of ID and rank info. – James Feb 22 '15 at 22:17
  • Update (a bit late): This was the strategy I ended up using. – James Aug 11 '15 at 23:23
  • Five years late to the party, but this answer finally solved an issue I've been working on all day, thanks a lot! – foakesm Sep 29 '20 at 13:27
3

In Pandas version 0.25.0 there is df.explode method for list explosion and some little code for dict explosion.

If your dataframe is:

import numpy as np
import pandas as pd

df = pd.DataFrame({'ID' : [13423,294847,322844,429847], 'RANKS': [[{u'name': u'A', u'price': u'$1.00', u'rank': u'1'},
{u'name': u'B', u'price': u'$4.00', u'rank': u'2'},
{u'name': u'C', u'price': u'$3.99', u'rank': u'3'},
{u'name': u'D', u'price': u'$2.00', u'rank': u'4'},
{u'name': u'E', u'price': u'$2.50', u'rank': u'5'}],

[{u'name': u'AA', u'price': u'$1.99', u'rank': u'1'},
{u'name': u'BB', u'price': u'$6.99', u'rank': u'2'}],

[{u'name': u'Z', u'price': u'$0.99', u'rank': u'1'},
{u'name': u'Y', u'price': u'$10.00', u'rank': u'2'},
{u'name': u'X', u'price': u'$1.99', u'rank': u'3'}],[]], 'count' : [5,2,3,0]})

then to explode lists you can do the following:

df = df.explode('RANKS')

which gives you

    ID  RANKS   count
0   13423   {'name': 'A', 'price': '$1.00', 'rank': '1'}    5
0   13423   {'name': 'B', 'price': '$4.00', 'rank': '2'}    5
0   13423   {'name': 'C', 'price': '$3.99', 'rank': '3'}    5
0   13423   {'name': 'D', 'price': '$2.00', 'rank': '4'}    5
0   13423   {'name': 'E', 'price': '$2.50', 'rank': '5'}    5
1   294847  {'name': 'AA', 'price': '$1.99', 'rank': '1'}   2
1   294847  {'name': 'BB', 'price': '$6.99', 'rank': '2'}   2
2   322844  {'name': 'Z', 'price': '$0.99', 'rank': '1'}    3
2   322844  {'name': 'Y', 'price': '$10.00', 'rank': '2'}   3
2   322844  {'name': 'X', 'price': '$1.99', 'rank': '3'}    3
3   429847  NaN 0

To explode these dicts and expand them as columns you can do the following:

df.reset_index(drop=True, inplace=True)

# Replace NaN by empty dict
def replace_nans_with_dict(series):
    for idx in series[series.isnull()].index:
        series.at[idx] = {}
    return series



# Explodes list and dicts
def df_explosion(df, col_name:str):

    if df[col_name].isna().any():
        df[col_name] = replace_nans_with_dict(df[col_name])

    df.reset_index(drop=True, inplace=True)

    df1 = pd.DataFrame(df.loc[:,col_name].values.tolist())

    df = pd.concat([df,df1], axis=1)

    df.drop([col_name], axis=1, inplace=True)

    return df

Run

df = df_explosion(df, 'RANKS')

and you will have:

ID  count   name    price   rank
0   13423   5   A   $1.00   1
1   13423   5   B   $4.00   2
2   13423   5   C   $3.99   3
3   13423   5   D   $2.00   4
4   13423   5   E   $2.50   5
5   294847  2   AA  $1.99   1
6   294847  2   BB  $6.99   2
7   322844  3   Z   $0.99   1
8   322844  3   Y   $10.00  2
9   322844  3   X   $1.99   3
10  429847  0   NaN NaN NaN
Okroshiashvili
  • 3,677
  • 2
  • 26
  • 40
1

I just encountered a similar scenario and here is how I solved it in the end:

In [1]: import pandas as pd

In [2]: df = pd.DataFrame({
   ...:     'ID' : [13423,294847,322844,429847],
   ...:     'RANKS': [[{u'name': u'A', u'price': u'$1.00', u'rank': u'1'},
   ...:                {u'name': u'B', u'price': u'$4.00', u'rank': u'2'},
   ...:                {u'name': u'C', u'price': u'$3.99', u'rank': u'3'},
   ...:                {u'name': u'D', u'price': u'$2.00', u'rank': u'4'},
   ...:                {u'name': u'E', u'price': u'$2.50', u'rank': u'5'}],
   ...: 
   ...:               [{u'name': u'AA', u'price': u'$1.99', u'rank': u'1'},
   ...:                {u'name': u'BB', u'price': u'$6.99', u'rank': u'2'}],
   ...: 
   ...:               [{u'name': u'Z', u'price': u'$0.99', u'rank': u'1'},
   ...:                {u'name': u'Y', u'price': u'$10.00', u'rank': u'2'},
   ...:                {u'name': u'X', u'price': u'$1.99', u'rank': u'3'}],[]]})

In [3]: import itertools

In [4]: temp_df = pd.DataFrame(
   ...:     list(itertools.chain(*[zip([key]*len(val), val)
   ...:                            for key, val in df.RANKS.iteritems()])),
   ...:     columns=['idx', 'explode'])                  

In [5]: exploded = pd.merge(
   ...:     df.drop('RANKS', axis=1),
   ...:     temp_df.explode.apply(pd.Series).join(temp_df.idx),
   ...:     left_index=True,
   ...:     right_on='idx',
   ...:     how='left').drop('idx', axis=1)

The exploded data frame looks like:

In [6]: exploded
Out[6]: 
       ID name   price rank
0   13423    A   $1.00    1
1   13423    B   $4.00    2
2   13423    C   $3.99    3
3   13423    D   $2.00    4
4   13423    E   $2.50    5
5  294847   AA   $1.99    1
6  294847   BB   $6.99    2
7  322844    Z   $0.99    1
8  322844    Y  $10.00    2
9  322844    X   $1.99    3
9  429847  NaN     NaN  NaN
Pedro M Duarte
  • 26,823
  • 7
  • 44
  • 43