6

I am trying to make a R ggplot2 plot picked from here in Python. I am looking at the correlation scatter plot, which looks like the following. enter image description here

Importing data

import pandas as pd
midwest= pd.read_csv("https://raw.githubusercontent.com/selva86/datasets/master/midwest.csv") 

Default Pandas scatter plot

midwest.plot(kind='scatter', x='area', y='poptotal', ylim=((0, 50000)), xlim=((0., 0.1)))

The above code by itself will not colour code different categories and will look like the following.

enter image description here

Pandas Groupby + Scatter plot

However, we can group the dataframe by the `state' and then individually do a scatter plot for each group (ref).

fig, ax = plt.subplots()
groups = midwest.groupby('state')
for name, group in groups:
    ax.plot(group.area, group.poptotal, marker='o', linestyle='', ms=10, 
label=name)
ax.legend(numpoints=1)
ax.set_ylim((0, 500000))

enter image description here

While this does get us different categories in the scatter plot, it does not get them sized by popdensity.

Seaborn pair plot

 import seaborn as sns
sns.pairplot(x_vars=["area"], y_vars=["poptotal"], data=midwest, 
hue="state", size=5)
plt.gca().set_ylim((0, 50000))

enter image description here

Again, this only does a scatter plot by the category. However, we don't still have the marker size by popdensity

Matplotlib

Here's how we can go down to each data point and make the plot in Matplotlib.

fig, ax = plt.subplots()
groups = midwest.groupby('state')
min_popdensity, max_popdensity = midwest['popdensity'].min(), midwest['popdensity'].max()
for name, group in groups:
    for data_point in group.itertuples():
        ax.plot(data_point.area, data_point.poptotal, marker='o', linestyle='', 
                   ms=1+12*((max_popdensity-data_point.popdensity)/(max_popdensity-min_popdensity)), label=name)
ax.set_ylim((0, 500000))

This produces a plot nearly similar to the goal plot. enter image description here

Questions

  1. How do we get the markersize as per the popdensity of the point without doing all the heavy lifting (like plotting each point individually)?
  2. How do we add the smooth line showed in the ggplot visualisation.

Additional information

Here is the head of the dataframe midwest.

PID county  state   area    poptotal    popdensity  popwhite    popblack    popamerindian   popasian    ... percollege  percprof    poppovertyknown percpovertyknown    percbelowpoverty    percchildbelowpovert    percadultpoverty    percelderlypoverty  inmetro category
0   561 ADAMS   IL  0.052   66090   1270.961540 63917   1702    98  249 ... 19.631392   4.355859    63628   96.274777   13.151443   18.011717   11.009776   12.443812   0   AAR
1   562 ALEXANDER   IL  0.014   10626   759.000000  7054    3496    19  48  ... 11.243308   2.870315    10529   99.087145   32.244278   45.826514   27.385647   25.228976   0   LHR
2   563 BOND    IL  0.022   14991   681.409091  14477   429 35  16  ... 17.033819   4.488572    14235   94.956974   12.068844   14.036061   10.852090   12.697410   0   AAR
3   564 BOONE   IL  0.017   30806   1812.117650 29344   127 46  150 ... 17.278954   4.197800    30337   98.477569   7.209019    11.179536   5.536013    6.217047    1   ALU
4   565 BROWN   IL  0.018   5836    324.222222  5264    547 14  5   ... 14.475999   3.367680    4815    82.505140   13.520249   13.022889   11.143211   19.200000   0   AAR

And, here is the ggplot2 code being used in the original post.

options(scipen=999)  # turn-off scientific notation like 1e+48
library(ggplot2)
theme_set(theme_bw())  # pre-set the bw theme.
data("midwest", package = "ggplot2")


# Scatterplot
gg <- ggplot(midwest, aes(x=area, y=poptotal)) + 
  geom_point(aes(col=state, size=popdensity)) + 
  geom_smooth(method="loess", se=F) + 
  xlim(c(0, 0.1)) + 
  ylim(c(0, 500000)) + 
  labs(subtitle="Area Vs Population", 
       y="Population", 
       x="Area", 
       title="Scatterplot", 
       caption = "Source: midwest")

plot(gg)

EDIT

I do not know if the question will be reopened (marked as duplicate). In the meanwhile, here is a Pandas only answer that works reasonably well.

fig, ax = plt.subplots()
groups = midwest.groupby('state')
colors = ['b','g','r','y','k']
for i, (name, group) in enumerate(groups):
    group.plot(kind='scatter', x='area', y='poptotal', ylim=((0, 50000)), xlim=((0., 0.1)), s=10+group['popdensity']*0.01, label=name, ax=ax, color=colors[i])
lgd = ax.legend(numpoints=1)
for handle in lgd.legendHandles:
    handle.set_sizes([100.0])
ax.set_ylim((0, 500000))

enter image description here

Edit 2

The following answer mentioned in the comments by https://stackoverflow.com/users/3707607/ted-petrou seems to solve the problem using seaborn.

sizes = [10, 40, 70, 100, 130] 
marker_size = pd.cut(4*midwest['popdensity'], [0, 20000, 40000, 60000, 80000, 1000000], labels=sizes) 
sns.lmplot('area', 'poptotal', data=midwest, hue='state', fit_reg=False, scatter_kws={'s':marker_size})
plt.ylim((0, 500000))

enter image description here

Nipun Batra
  • 11,007
  • 11
  • 52
  • 77
  • Please tell us in how far setting the `s` argument to `plt.scatter` does not do the expected. E.g. in [this example](https://matplotlib.org/examples/pylab_examples/scatter_demo2.html) or [this question](https://stackoverflow.com/questions/14827650/pyplot-scatter-plot-marker-size). – ImportanceOfBeingErnest Jun 28 '17 at 12:41
  • @ImportanceOfBeingErnest One could go down to low level matplotlib and for each point. My edit shows the same. However, I doubt if this is the best approach. – Nipun Batra Jun 28 '17 at 12:58
  • 1
    I believe you just need to set s = df['popdensity'] – mauve Jun 28 '17 at 13:01
  • 1
    `sizes = [10, 40, 70, 100, 130] marker_size = pd.cut(midwest['popdensity'], [0, 20000, 40000, 60000, 80000, 1000000], labels=sizes) sns.lmplot('area', 'poptotal', data=midwest, hue='state', fit_reg=False, scatter_kws={'s':marker_size})` – Ted Petrou Jun 28 '17 at 13:17
  • Thanks @TedPetrou. This seems to be the best answer. Till the time the question is marked as duplicate, I'll paste the answer above. Happy to accept your answer if the duplicate flag is removed. – Nipun Batra Jun 28 '17 at 13:24
  • @mauve Thanks. Added your suggestion. – Nipun Batra Jun 28 '17 at 13:25
  • No problem. Question should be reopened. – Ted Petrou Jun 28 '17 at 13:25
  • No, sorry, question is still a duplicate. I added respective solution which use pandas and seaborn. The key is always to use the scatter's size argument. – ImportanceOfBeingErnest Jun 28 '17 at 13:26
  • @ImportanceOfBeingErnest Sure, I am happy to accept this question as a duplicate now. Thanks a ton for keeping the site running amazingly well and putting all the efforts for community convenience. Kudos! – Nipun Batra Jun 28 '17 at 13:33
  • 4
    This question isn't at all a duplicate. When you have to link to 5 different questions that each answer a portion of the answer then its not a duplicate. This question requires the scatter_kws argument to customize marker size, pd.cut to create the correct categories and needs an extra legend added based directly on marker size. The orginal R plot has not at all been replicated by any of these. Not to mention a regression line is missing as well. My answer doesn't even do it. @ImportanceOfBeingErnest – Ted Petrou Jun 28 '17 at 13:37
  • 1
    The question uses 4 different attempts to get the marker size as a function of another quantity. One may therefore consider this a duplicate of 4 different questions. An answer does not need `pd.cut` this is just something you may use if you want. An alternative of closing it as duplicate of many questions is to close it as being too broad as the question is not specific enough for a single answer. If the questioner now that they know how to set the size argument has further problems implementing that to reproduce the ggplot plot, they're free to ask a new **very specific** question about it. – ImportanceOfBeingErnest Jun 28 '17 at 13:47
  • @ImportanceOfBeingErnest: How about I edit the title to make this plot in Seaborn/Pandas? I think the question would then be specific enough? – Nipun Batra Jun 28 '17 at 14:13
  • Changing the title alone is not sufficient to open this question I would say. I guess you may edit the question to make it more specific. That would include statements about those other questions.Ignoring the existance of those duplicates does not make this a non-duplicate.So a valid question would need to link to them and say how using their answer is still not giving you the desired result. (Which is hard I guess, because they actually do give you the plot you want; apart from the KDE line, however, asking two questions at once if off-topic anyways) – ImportanceOfBeingErnest Jun 28 '17 at 14:31

0 Answers0