1

I've tried and tried, all day to try and make this work and it's starting to make me angry! All I want to do is create a necessary pandas series for input into upsetplot as detailed here:

https://pypi.org/project/upsetplot/

I don't understand how the generate_data function is manipulating its sets to make a series. I would have assumed that there was a simple way to do this by calling set(), but I can't seem to find it.

So I instead began manipulating my dataframes directly but suspected the attempts were misguided.

Thus I resort to providing a simple dataframe below and pray that some kind soul can enlighten me.

import pandas as pd
from matplotlib import pyplot as plt
from upsetplot import generate_data, plot

df = pd.DataFrame({'john':[1,2,3,5,7,8],
              'jerry':[1,2,5,7,9,2],
              'josie':[2,2,3,2,5,6],
              'jean':[6,5,7,6,2,4]})

df = pd.DataFrame({'john':[True,False,True,False,True,False],
              'jerry':[True,True,False,True,False,True],
              'josie':[True,False,False,True,False,False],
              'jean':[True,False,False,True,False,False],
              'food':['apple','carrot','choc','bread','ham','nut']})

the example from the package home

from upsetplot import generate_data
example = generate_data(aggregated=True)
example  # doctest: +NORMALIZE_WHITESPACE
set0   set1   set2
False  False  False      56
              True      283
       True   False    1279
              True     5882
True   False  False      24
              True       90
       True   False     429
              True     1957
Name: value, dtype: int64
joeln
  • 3,563
  • 25
  • 31
Jeff S.
  • 87
  • 2
  • 11
  • Please mention your expected output. – Abdur Rehman Jan 04 '19 at 06:29
  • `df` is your input dataframe ? – Abdur Rehman Jan 04 '19 at 06:35
  • I'd expect a pandas series object like the one shown on the PyPI page. i've included it above. df is the dataframe yes. but its just an example to start with, I'm beyond caring how the df is set up (i.e. whether the values are strings, integeres, booleans etc) because im just so perplexed – Jeff S. Jan 04 '19 at 06:40
  • So you want a dataframe like this but last column will be replaced by your `food` column. If I am not right then mention your expected output with respect to your input dataframe as your output is still very vague and confused. – Abdur Rehman Jan 04 '19 at 06:46
  • exactly. for the pandas series in 'example' the sets of booleans are all part of the index and the counts are the values. sorry i see what you mean, i'll change the df – Jeff S. Jan 04 '19 at 06:54
  • @JeffS. - Answer is edited, is it what need? – jezrael Jan 04 '19 at 06:57
  • The interface for inputting data to upsetplot has been improved in version 0.2, although could probably be improved further. See [`from_memberships`](https://upsetplot.readthedocs.io/en/stable/api.html#upsetplot.from_memberships) – joeln May 06 '19 at 07:24

1 Answers1

2

Aggregate count by GroupBy.size with all columns without food:

df = pd.DataFrame({'john':[True,False,True,False,True,False],
              'jerry':[True,True,False,True,False,True],
              'josie':[True,False,False,True,False,False],
              'jean':[True,False,False,True,False,False],
              'food':['apple','carrot','choc','bread','ham','nut']})

cols = df.columns.difference(['food']).tolist()
s = df.groupby(cols).size()
print (s)
jean   jerry  john   josie
False  False  True   False    2
       True   False  False    2
True   True   False  True     1
              True   True     1
dtype: int64
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252