8

(Or a list of lists... I just edited)

Is there an existing python/pandas method for converting a structure like this

food2 = {}
food2["apple"]   = ["fruit", "round"]
food2["bananna"] = ["fruit", "yellow", "long"]
food2["carrot"]  = ["veg", "orange", "long"]
food2["raddish"] = ["veg", "red"]

into a pivot table like this?

+---------+-------+-----+-------+------+--------+--------+-----+
|         | fruit | veg | round | long | yellow | orange | red |
+---------+-------+-----+-------+------+--------+--------+-----+
| apple   | 1     |     | 1     |      |        |        |     |
+---------+-------+-----+-------+------+--------+--------+-----+
| bananna | 1     |     |       | 1    | 1      |        |     |
+---------+-------+-----+-------+------+--------+--------+-----+
| carrot  |       | 1   |       | 1    |        | 1      |     |
+---------+-------+-----+-------+------+--------+--------+-----+
| raddish |       | 1   |       |      |        |        | 1   |
+---------+-------+-----+-------+------+--------+--------+-----+

Naively, I would probably just loop through the dictionary. I see how I can use a map on each inner list, but I don't know how to join/stack them over the dictionary. Once I did join them, I could just use pandas.pivot_table

for key in food2:
    attrlist = food2[key]
    onefruit_pairs = map(lambda x: [key, x], attrlist)
    one_fruit_frame = pd.DataFrame(onefruit_pairs, columns=['fruit', 'attr'])
    print(one_fruit_frame)

     fruit    attr
0  bananna   fruit
1  bananna  yellow
2  bananna    long
    fruit    attr
0  carrot     veg
1  carrot  orange
2  carrot    long
   fruit   attr
0  apple  fruit
1  apple  round
     fruit attr
0  raddish  veg
1  raddish  red
Mark Miller
  • 3,011
  • 1
  • 14
  • 34

2 Answers2

3

An answer using pandas.

# Test data
food2 = {}
food2["apple"]   = ["fruit", "round"]
food2["bananna"] = ["fruit", "yellow", "long"]
food2["carrot"]  = ["veg", "orange", "long"]
food2["raddish"] = ["veg", "red"]

df = DataFrame(dict([ (k,Series(v)) for k,v in food2.items() ]))
# pivoting to long format
df = pd.melt(df, var_name='item', value_name='categ')
# cross-tabulation
df = pd.crosstab(df['item'], df['categ'])
# sorting index, maybe not necessary    
df.sort_index(inplace=True)
df

# categ    fruit  long  orange  red  round  veg  yellow
# item                                                 
# apple        1     0       0    0      1    0       0
# bananna      1     1       0    0      0    0       1
# carrot       0     1       1    0      0    1       0
# raddish      0     0       0    1      0    1       0
Romain
  • 19,910
  • 6
  • 56
  • 65
  • Tested with the same imput from the other answer. Oddly enough, performance is not that far for that input ( 279936 rows by 1000 columns, very sparse). – Paulo Scardine Jan 12 '16 at 00:03
2

Pure python:

from itertools import chain

def count(d):
    cols = set(chain(*d.values()))
    yield ['name'] + list(cols)
    for row, values in d.items():
        yield [row] + [(col in values) for col in cols]

Testing:

>>> food2 = {           
    "apple": ["fruit", "round"],
    "bananna": ["fruit", "yellow", "long"],
    "carrot": ["veg", "orange", "long"],
    "raddish": ["veg", "red"]
}

>>> list(count(food2))
[['name', 'long', 'veg', 'fruit', 'yellow', 'orange', 'round', 'red'],
 ['bananna', True, False, True, True, False, False, False],
 ['carrot', True, True, False, False, True, False, False],
 ['apple', False, False, True, False, False, True, False],
 ['raddish', False, True, False, False, False, False, True]]

[update]

Performance test:

>>> from itertools import product
>>> labels = list("".join(_) for _ in product(*(["ABCDEF"] * 7)))
>>> attrs = labels[:1000]
>>> import random
>>> sample = {}
>>> for k in labels:
...     sample[k] = random.sample(attrs, 5)
>>> import time
>>> n = time.time(); list(count(sample)); print time.time() - n                                                                
62.0367980003

It took less than 2 minutes, for 279936 rows by 1000 columns on my busy machine (lots of chrome tabs open). Let me know if the performance is unacceptable.

[update]

Testing the performance from the other answer:

>>> n = time.time(); \
...     df = pd.DataFrame(dict([(k, pd.Series(v)) for k,v in sample.items()])); \
...     print time.time() - n
72.0512290001

The next line (df = pd.melt(...)) was taking too long so I canceled the test. Take this result with a grain of salt because it was running on a busy machine.

Paulo Scardine
  • 73,447
  • 11
  • 124
  • 153
  • Excellent. Do you have any intuition on how this would perform (compared to some yet unspecified Pandas magic) on hundreds of thousands of "fruit" and thousands of attributes? – Mark Miller Jan 11 '16 at 18:26
  • I "had" to import itertools – Mark Miller Jan 11 '16 at 18:37
  • 1
    This solution is optimized for simplicity instead of performance. There is plenty of room for improvement, specially if you know the attributes in advance. Updated with the missing "import". – Paulo Scardine Jan 11 '16 at 18:42
  • Would you be kind enough to compare the answers regarding the performance when applied to your data? – Paulo Scardine Jan 12 '16 at 00:32