1

I have a dataset with about 320k records. Of these, I want to display a swarmplot with the top 20 entities of the category in the x axis (Refined_ID in this case) by their count. How can one achieve that? For example, if my data is:

Refined_ID Refined_Age Name
e123        21         foo1
f123        19         bar1
z123        26         foo2
f123        29         bar2
e123        20         foo1
e1342       19         bar3
f123        20         foo3

I would like my x-axis to be ordered as:

e123 f123 z123   

This is my code:

g = sns.swarmplot(x = dfAnalysis['Refined_ID'].iloc[:20],y = dfAnalysis['Refined_Age'], hue = dfAnalysis['Name'], orient="v")
g.set_xticklabels(g.get_xticklabels(),rotation=30)

As the dataframe is quite large, am limiting the view to first 20 rows for testing.

UPDATE 1

Assuming there isn't a way to dynamically sort the axes in seaborn, this is what I want my output to look like:

Refined_ID Refined_Age     Name   Count_of_Refined_ID
    e123        21         foo1     2
    f123        19         bar1     3
    z123        26         foo2     1
    f123        29         bar2     3
    e123        20         foo1     3
    e1342       19         bar3     1
    f123        20         foo3     3

From this dataframe, I would then want to plot the top two Refined_IDs based on their count. In this case, those two categories will be e123 and f123. The plot will have:

x-axis: Refined ID (e123 and f123)
y-axis: Refined_Age (0 to 30)
Hue: Based on Name
Marcus Campbell
  • 2,746
  • 4
  • 22
  • 36
kurious
  • 1,024
  • 10
  • 29
  • 1
    Sort your values before you graph them. http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.DataFrame.sort_values.html – seanmus Feb 23 '16 at 21:04
  • I want the x axis categorical variable sorted by its count. Also, as two categorical variables are involved, I am unsure about how to include them in the same dataframe if I aggregate a variable by its count. – kurious Feb 23 '16 at 21:43
  • Perhaps this is an approach: 1. Add the count to this dataframe 2. Create a new dataframe with top values of interest 3. Plot the new dataframe It'd be really nice to have a shorter way though – kurious Feb 23 '16 at 22:32
  • No idea what you're doing here. How do you count 3 instances of e1342? I only see one of them. And now you want only two categories, that's a new requirement? – Stop harming Monica Feb 23 '16 at 22:38
  • Thank you for pointing out the typo. Corrected it. In the sample output, displaying 2 categories out of the 4 is analogous to displaying the top 20 from the entire database. Hope the context is a little clearer now. – kurious Feb 23 '16 at 22:55
  • The count for e123 is also wrong. Now if you want the dataframe sorted by the actual counts, that's in my answer. If that's not what you want I am at a lost. – Stop harming Monica Feb 23 '16 at 23:17

1 Answers1

1

Is this what you want?

counts = df['Refined_ID'].value_counts()
ix = (df['Refined_ID'].apply(lambda x: counts[x])
      .sort_values(ascending=False).index)
df.reindex(ix)

  Refined_ID  Refined_Age  Name
6       f123           20  foo3
3       f123           29  bar2
1       f123           19  bar1
4       e123           20  foo1
0       e123           21  foo1
5      e1342           19  bar3
2       z123           26  foo2
Stop harming Monica
  • 12,141
  • 1
  • 36
  • 56
  • This answer works. I just did minor tweaks for my purposes. Specifically, added a separate column named 'Counts' and copied everything into a new dataframe from which to plot the graph. – kurious Feb 23 '16 at 23:55
  • This code runs pretty well with small dataset. It's taking several hours with all the records. The follow-up question is at http://stackoverflow.com/questions/35591262/looking-for-a-faster-way-to-add-data-from-a-large-couchdb-database-into-pandas. Will be grateful if you can have a look. – kurious Feb 24 '16 at 06:40