Pandas: scatterplot with points sized by unique values of one column against the corresponding values of another column

Question

Given the following sample Dataframe:

df = pd.DataFrame( { 'A' : [ 1, 1, 1, 2, 2, 2, 3, 3, 3 ],
                     'B' : [ 'x', 'y', 'z', 'x', 'y', 'y', 'x', 'x', 'x' ] } )

I want to generate a scatterplot of the unique values of B (with the points sized by the number of B values within each group of unique values) against their corresponding values of A, so I want to get the following three lists:

A = [ 1, 1, 1, 2, 2, 3 ]
B = ['x', 'y', 'z', 'x', 'y', 'x']
Bsize = [ 1, 1, 1, 1, 2, 3]

I've tried doing this with groupby:

group = df.groupby(['A','B'])

The keys of the group contain the data I want, but they're not ordered:

group.group.keys()
[(1, 2), (1, 3), (3, 1), (2, 1), (2, 2), (1, 1)]

The 'first' method returns what looks like a Dataframe, but I can't access the 'A' and 'B' keys:

group.first()['A']
...
KeyError: 'A'

If I iterate through the names and groups, things seem to be ordered, so I can get what I want by doing:

A = []
B = []
for name, _ in group:
    A.append(name[0])
    B.append(name[1])

I can then get the Bsize list by doing:

group['B'].count().values
array([1, 1, 1, 1, 2, 3])

However, this seems clunky in the extreme and suggests to me I haven't understood how to properly use the group.

Fabio Lamanna · Accepted Answer · 2017-02-15T12:01:54.433

IIUC maybe you can import numpy as np and:

In [52]: group = df.groupby(['A','B']).apply(np.unique).reset_index()

In [53]: group
Out[53]: 
   A  B       0
0  1  x  [1, x]
1  1  y  [1, y]
2  1  z  [1, z]
3  2  x  [2, x]
4  2  y  [2, y]
5  3  x  [3, x]

then:

In [57]: A = group['A'].tolist()

In [58]: B = group['B'].tolist()

In [59]: A
Out[59]: [1, 1, 1, 2, 2, 3]

In [60]: B
Out[60]: ['x', 'y', 'z', 'x', 'y', 'x']

to get all the lists you need in one shot you can:

In [87]: group = df.groupby(['A','B']).size().reset_index(name='s')

In [88]: group
Out[88]: 
   A  B  s
0  1  x  1
1  1  y  1
2  1  z  1
3  2  x  1
4  2  y  2
5  3  x  3

Bsize:

In [91]: group['s'].tolist()
Out[91]: [1, 1, 1, 1, 2, 3]

A:

In [92]: group['A'].tolist()
Out[92]: [1, 1, 1, 2, 2, 3]

B:

In [93]: group['B'].tolist()
Out[93]: ['x', 'y', 'z', 'x', 'y', 'x']

EDIT: in the last dataframe you have all the information you need, so you can keep only the last one to get all of your lists.

Since `df.groupby(['A','B']).size().reset_index(name='s')` contains all the information, perhaps consider removing `np.unique`? — unutbu, Feb 15 '17 at 11:37

Pandas: scatterplot with points sized by unique values of one column against the corresponding values of another column

1 Answers1