7

Simple dictionary:

d = {'a': set([1,2,3]), 'b': set([3, 4])}

(the sets may be turned into lists if it matters)

How do I convert it into a long/tidy DataFrame in which each column is a variable and every observation is a row, i.e.:

  letter  value
0      a      1
1      a      2
2      a      3
3      b      3
4      b      4

The following works, but it's a bit cumbersome:

id = 0
tidy_d = {}
for l, vs in d.items():
    for v in vs:
        tidy_d[id] = {'letter': l, 'value': v}
        id += 1
pd.DataFrame.from_dict(tidy_d, orient = 'index')

Is there any pandas magic to do this? Something like:

pd.DataFrame([d]).T.reset_index(level=0).unnest()

where unnest obviously doesn't exist and comes from R.

Giora Simchoni
  • 3,487
  • 3
  • 34
  • 72

5 Answers5

3

Use numpy.repeat with chain.from_iterable:

from itertools import chain

df = pd.DataFrame({
    'letter' : np.repeat(list(d.keys()), [len(v) for k, v in d.items()]),
    'value' : list(chain.from_iterable(d.values())), 
})
print (df)
  letter  value
0      a      1
1      a      2
2      a      3
3      b      3
4      b      4
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • 1
    You can use `list(d)` here. Also `list(map(len, d.values()))` [or, at least iterate `values` here rather than `items`]. – jpp Oct 03 '18 at 16:14
3

You can use a comprehension with itertools.chain and zip:

from itertools import chain

keys, values = map(chain.from_iterable, zip(*((k*len(v), v) for k, v in d.items())))

df = pd.DataFrame({'letter': list(keys), 'value': list(values)})

print(df)

  letter  value
0      a      1
1      a      2
2      a      3
3      b      3
4      b      4

This can be rewritten in a more readable fashion:

zipper = zip(*((k*len(v), v) for k, v in d.items()))
values = map(list, map(chain.from_iterable, zipper))

df = pd.DataFrame(list(values), columns=['letter', 'value'])
jpp
  • 159,742
  • 34
  • 281
  • 339
  • 1
    notice there's an issue if the keys in `d` are int, i.e. `d = {0: set([1,2,3]), 1: set([3, 4])}`, in which case it will throw a TypeError. – Giora Simchoni Oct 24 '18 at 12:33
  • @GioraSimchoni, Good point. Of course, OP has specified strings in the input dictionary, but your point should be noted for adapting to other problems. – jpp Oct 24 '18 at 13:03
  • Got bitten by int keys - `keys, values = map(chain.from_iterable, zip(*(([k]*len(v), v) for k, v in d.items())))` for adapting to other problems – Mr_and_Mrs_D Apr 10 '20 at 10:38
1

A tad more "pandaic", inspired by this post:

pd.DataFrame.from_dict(d, orient = 'index') \
  .rename_axis('letter').reset_index() \
  .melt(id_vars = ['letter'], value_name = 'value') \
  .drop('variable', axis = 1) \
  .dropna()
Giora Simchoni
  • 3,487
  • 3
  • 34
  • 72
  • Can you explain a bit what melt does? – Mr_and_Mrs_D Apr 09 '20 at 00:02
  • 1
    Think of a "wide" data table which melts into a "long" one. Columns "week1", "week2", "week3" melt into a single column named "week" having values "1", "2" and "3", and an additional column "value" is added. The docs may explain it better: https://pandas.pydata.org/docs/reference/api/pandas.melt.html – Giora Simchoni Apr 09 '20 at 05:24
1

Some timings of melt and slightly modified chain answers:

import random
import timeit
from itertools import chain
import pandas as pd
print(pd.__version__)

dict_size = 1000000
randoms = [random.randint(0, 100) for __ in range(10000)]
max_list_size = 1000
d = {k: random.sample(randoms, random.randint(1, max_list_size)) for k in
     range(dict_size)}

def chain_():
    keys, values = map(chain.from_iterable,
                       zip(*(([k] * len(v), v) for k, v in d.items())))
    pd.DataFrame({'letter': list(keys), 'value': list(values)})

def melt_():
    pd.DataFrame.from_dict(d, orient='index'
        ).rename_axis('letter').reset_index(
        ).melt(id_vars=['letter'], value_name='value'
        ).drop('variable', axis=1).dropna()

setup ="""from __main__ import chain_, melt_"""
repeat = 3
numbers = 10
def timer(statement, _setup=''):
  print(min(
    timeit.Timer(statement, setup=_setup or setup).repeat(repeat, numbers)))

print('timing')
timer('chain_()')
timer('melt_()')

Seems melt is faster for max_list_size 100:

1.0.3
timing
246.71311019999996
204.33705529999997

and slower for max_list_size 1000:

2675.8446872
4565.838648400002

probably because of assigning the memory for a much bigger df than needed

A variation of chain answer:

def chain_2():
    keys, values = map(chain.from_iterable,
                       zip(*((itertools.repeat(k, len(v)), v) for k, v in d.items())))
    pd.DataFrame({'letter': list(keys), 'value': list(values)})

doesn't seem to be any faster

(python 3.7.6)

Mr_and_Mrs_D
  • 32,208
  • 39
  • 178
  • 361
0

Just another one,

from collections import defaultdict
e = defaultdict(list)
for key, val in d.items():
    e["letter"] += [key] * len(val)
    e["value"] += list(val)
df = pd.DataFrame(e)
efajardo
  • 797
  • 4
  • 9