plotnine doesn't add legend

Question

I'm using plotnine to plot two graphs in the same plot. one graph uses the 'b' values from the dataframe you'll see below, and another one uses the values from 'c'.

All I need is to show a simple legend legend where I see 'c' and 'b' with their according color.

def plot_log_detected():
    df = DataFrame({'x': [1, 2, 3, 4, 5],
                    'b': >>>SOME VALUES DOESNT MATTER<<<,
                    'c': >>>SOME VALUES DOESNT MATTER<<<
                   })
    return ggplot(aes(x='x', y='b'), data=df) + geom_point(size=1) +\
           geom_line(aes(y='b'), color='black') + \
           geom_line(aes(y='c'), color='blue') +  \
           ggtitle("TITLE") + \
           labs(y="Y AXIS", x="X AXIS")

When you post a question, try to make sure your example is runnable. In this case you would include some made-up numbers instead of `>>>SOME VALUES DOESNT MATTER<<<`. This makes it easy for others to take your example and straight away focus on the problem, instead of making up data. — has2k1, Apr 13 '20 at 16:05

chemdork123 · Accepted Answer · 2020-04-12T23:39:23.140

This won't show a legend if you use ggplot2 in R either: the legend for color will only be represented when you specify color= within an aesthetic for a geom. The "fix" is the same in both python or ggplot for r. You need to organize your data so that you follow tidy data principles. In this case, df$b and df$c columns each contain two pieces of information: (1) value of "y" and (2) type of "y". You should reorganize the data accordingly so that your column names become: x, type_of_y, and value_of_y.

I'll explain by filling in a dataset like you presented, then indicate how we can change it to a tidy format, then how you can (properly) apply the code to represent a plot like I believe you want to to.

The Basics

Here's a dataset and a plot like your plot (again, it's in r... So I hope you can translate into python):

df <- data.frame(
    x=c(1:5), b=c(10, 12, 14, 9, 8), c=c(9, 11, 11, 12, 14))

ggplot(df, aes(x=x)) +
    geom_line(aes(y=b), color='red') +
    geom_line(aes(y=c), color='blue')

No legend, but the colors are there and we plot what you would expect. The problem here is that ggplot draws a legend when you specify color in the aes() call. To see this clearly, let's just do the same plot, but move the color=... inside aes():

ggplot(df, aes(x=x)) +
    geom_line(aes(y=b, color='red')) +
    geom_line(aes(y=c, color='blue'))

Ok that's... wait. What? It has a legend now (because we put color inside aes()), but the colors are actually reversed in order and... you'll notice the colors are not red and blue, but the default "reddish" and "teal" colors of ggplot2. Actually, what happened is that we only specified that in the first geom_line call, we plotted the correct dataset, but we only "titled" the data as "red". Likewise, we "titled" the other dataset "blue". ggplot decided what colors to use based on the default palette.

Getting Your Legend Without Tidy Data

If you don't want to mess with your data, there is actually a way to do this and probably get an output you might be satisfied with. We just have to indicate in color= the name you want to call that series.

ggplot(df, aes(x=x)) +
    geom_line(aes(y=b, color='b')) +
    geom_line(aes(y=c, color='c'))

What about just adding another color='blue' to get a "blue" color outside the aes() as well as inside? Well... that doesn't work. If you do this, for example, the result is identical to the original plot shown (with no legend, but correct color values), since the aes() is effectively overwritten in each geom_line call:

# this doesn't work to keep legend and desired color, the second
# color outside aes() overwrites the one inside aes()
ggplot(df, aes(x=x)) +
    geom_line(aes(y=b, color='b'), color='red') +
    geom_line(aes(y=c, color='c'), color='blue')

The Tidy Data Way (The "correct" way)

While the above method works, it goes against the general principles of Tidy Data and how to organize you data so that it's easy to analyze... in ANY WAY you want to. Trust me: it's definitely the best practice moving forward for working with any dataset for versatility of analysis, and almost always worth the effort to organize your data in that way.

ggplot wants you to specify aes() parameters as columns in your dataset. That means we should make each column serve a specific purpose in your dataset as such:

x: This is the same x in the original dataset. It represents only the x-axis value
type_of_y: this column contains a value of either 'b' or 'c', indicating to which data series the values should be from.
value_of_y: this column contains the value you would plot on y.

Using dplyr, we can reorganize the data in this way pretty simply:

df <- df %>% gather('type_of_y', 'value_of_y', -x)

Giving you:

   x type_of_y value_of_y
1  1         b         10
2  2         b         12
3  3         b         14
4  4         b          9
5  5         b          8
6  1         c          9
7  2         c         11
8  3         c         11
9  4         c         12
10 5         c         14

Then you plot accordingly, using only one geom_line call and apply the color aesthetic to type_of_y. Something like this:

ggplot(df, aes(x=x, y=value_of_y)) +
    geom_line(aes(color=type_of_y))

In this way, you only have to specify one geom_line call. Might not seem too different here, but what if you had multiple columns in your original dataset? Take the case, for example, of having "x", then y values for "a", "b", "c"... "z"! You would have to specify all those lines in separate calls to geom_line! In the case above, no matter how many different y value columns you had... you only have the same two lines of code and only one call to geom_line. Make sense? For more information, I would suggest the link from above. Also, this article is a great read.

You can then assign specific colors by adding scale_color_manual and specifying the colors that way (there's a few other ways too) - but if you need assistance there, I would ask in a separate question. Also... not sure how the code differs for python. Similarly, you can change title of legend via labs(color="your new legend title")... among other theme changes.

I know it is not quite the same code in python, but that should be enough for you to figure our how to do it similarly there.

I don't really understand the first line, I did use color= in an aes for a geom, also the tidy data part (which I *will* read shortly) is a bit confusing for me, lets take for example df['b'], what do you mean by reorganizing the data in x, and what do you mean by separating it to type_of_y and value_of_y, (I think the best way to explane it to me would be in a very short example of a representing DataFrame), also thanks for the effort to answer this although you use R :) — drdisrespect, Apr 12 '20 at 22:23
I just did a significant edit to the original answer with a complete example. Hopefully this explains exactly how to go about adjusting your data and plotting the "correct" way using tidy data. With that being said, it IS still possible to get a proper legend by making a small change to your original code, but it's bad practice and gets really awkward and repetitive when you have many different data series in your original data. Again... yeah, it's not python, but your question is really dealing with principles of `ggplot` that should be universal for all of us :) — chemdork123, Apr 12 '20 at 23:33
In python, it gives me the error plotnine.exceptions.PlotnineError: "Could not evaluate the 'color' mapping: 'c' (original error: name 'c' is not defined)" when I specify any color inside aes. — kkgarg, Jul 14 '20 at 17:03

score 0 · Answer 2 · answered Apr 13 '20 at 04:50

You can melt your data frame to combine columns 'b' and 'c' into one column and create an aesthetic column 'color' for coloring and legend. Here is the code and output. Note that I used original dataframe for point plot (since you only plot column 'b' in that) and used the melted dataframe for line plot:

def plot_log_detected():
    df = DataFrame({'x': [1, 2, 3, 4, 5],
                    'b': [1, 2, 3, 4, 5],
                    'c': [1, 3, 2, 5, 4]
                   })

    df_melt = df.melt(id_vars=['x'], value_vars=['b','c'], var_name='color', value_name='b_and_c')

    return ggplot(aes(x='x', y='b'), data=df) + geom_point(size=1) +\
           geom_line(aes(y='b_and_c', color='color'), data=df_melt) + \
           ggtitle("TITLE") + \
           labs(y="Y AXIS", x="X AXIS")

Your original example dataframe looks like this:

And your melted dataframe is:

   x color  b_and_c
0  1     b        1
1  2     b        2
2  3     b        3
3  4     b        4
4  5     b        5
5  1     c        1
6  2     c        3
7  3     c        2
8  4     c        5
9  5     c        4

And finally this is output image:

score 0 · Answer 3 · answered May 20 '21 at 13:04

def plot_log_detected():
    df = DataFrame({'x': [1, 2, 3, 4, 5],
                    'b': >>>SOME VALUES DOESNT MATTER<<<,
                    'c': >>>SOME VALUES DOESNT MATTER<<<
                   })
    plot = (
        ggplot(aes(x='x', y='b'), data=df) 
        + geom_point(size=1)
        + geom_line(aes(y='b', color='"black"')) # Put color in double quotes
        + geom_line(aes(y='c', color='"blue"'))  # Put color in double quotes
        + ggtitle("TITLE")
        + labs(y="Y AXIS", x="X AXIS")
        
        # Add color scale identity
        + scale_color_identity(
            guide='legend', 
            breaks=['black', 'blue'], 
            labels=['Label for black', 'Label for blue']))
    return plot

While this code may solve the question, [including an explanation](//meta.stackexchange.com/q/114762) of how and why this solves the problem would really help to improve the quality of your post, and probably result in more up-votes. Remember that you are answering the question for readers in the future, not just the person asking now. Please [edit] your answer to add explanations and give an indication of what limitations and assumptions apply. — Adrian Mole, May 20 '21 at 13:54

score 0 · Answer 4 · edited Sep 07 '21 at 14:02

I have got another solution, where I have used melt to convert the wide data format to long data format. For generating legend we need to supply a grouping column into aesthetic mapping, thus using melt we can create a column label category and pass that into plotnine color argument.

def plot_log_detected():
    df = DataFrame({'x': [1, 2, 3, 4, 5],
                    'b': [22,33,21,66,55],
                    'c': [44,11,22,77,55]
                   })

    long_data = pd.melt(df, id_vars=["x"], value_vars=["b", "c"])
    long_data = long_data.rename(columns = {'variable':'category'})

    return ggplot(aes(x='x', y='value', color = "category"), data=long_data) +\
           geom_point(size=1) +\
           geom_line() + \
           ggtitle("TITLE") + \
           labs(y="Y AXIS", x="X AXIS")

plot_log_detected()

plotnine doesn't add legend

4 Answers4

Linked