4

I'm trying to plot trendlines on multiple traces on scatters in plotly. I'm kind of stumped on how to do it.

fig = go.Figure()
fig.add_trace(go.Scatter(x=df_df['Circumference (meters)'], 
                         y=df_df['Height (meters)'], 
                         name='Douglas Fir', mode='markers')
             )
fig.add_trace(go.Scatter(x=df_wp['Circumference (meters)'], 
                         y=df_wp['Height (meters)'],  
                         name='White Pine',mode='markers'),
             )
fig.update_layout(title="Tree Circumference vs Height (meters)",
                  xaxis_title=df_df['Circumference (meters)'].name,
                  yaxis_title=df_df['Height (meters)'].name,
                  title_x=0.5)

fig.show()

Trying to get something like this:

enter image description here

vestland
  • 55,229
  • 37
  • 187
  • 305
David 54321
  • 568
  • 1
  • 9
  • 23

2 Answers2

1

You've already put together a procedure that solves your problem, but I would like to mention that you can use plotly.express and do the very same thing with only a very few lines of code. Using px.scatter() there are actually two slightly different ways, depending on whether your data is of a long or wide format. Your data seems to be of the latter format, since you're asking:

how can I make this work with separate traces?

So I'll start with that. And I'll use a subset of the built-in dataset px.data.stocks() since you haven't provided a data sample.

Code 1 - Wide data

fig_wide = px.scatter(df_wide, x = 'index', y = ['AAPL', 'MSFT'],
                      trendline = 'ols',
                     )

Code 2 - Long data

fig_long = px.scatter(df_long, x= 'index', y = 'value',
                      color = 'variable',
                      trendline = 'ols')

Plot 1 - Identical results

enter image description here

About the data:

A dataframe of a wide format typically has an index with unique values in the left-most column, variable names in the column headers, and corresponding values for each variable per index in the columns like this:

index AAPL      MSFT
0     1.000000  1.000000
1     1.011943  1.015988
2     1.019771  1.020524
3     0.980057  1.066561
4     0.917143  1.040708

Here, adding information about another variable would require adding another column.

A dataframe of a long format, on the other hand, typically organizes the same data with only (though not necessarily only) three columns; index, variable and value:

index  variable  value
0      AAPL      1.000000
1      AAPL      1.011943
.
.
100    MSFT      1.720717
101    MSFT      1.752239

An contrary to the wide format, this means that index will have duplicate values. But for a good reason.

So what's the difference?

If you look at Code 1 you'll see that the only thing you need to specify for px.scatter in order to get multiple traces with trendlines, in this case AAPL and MSFT on the y-axis versus an index on the x-axis, is trendline = 'ols'. This is because plotly.express automatically identifies the data format as wide and knows how to apply the trendlines correctly. Different columns means different catrgories for which a trace and trendline are produced.

As for the "long approach", you've got both GOOG and AAPL in the same variable column, and values for both of them in the value column. But setting color = 'variable' lets plotly.express know how to categorize the variable column, correctly separate the data in in the value column, and thus correctly produce the trendlines. A different name in the variable column means that index and value in the same row belongs to different categories, for which a new trace and trendline are built.

Any pros and cons?

The arguably only advantage with the wide format is that it's easier to read (particularly for those of us damaged by too many years of sub-excellent data handling with Excel). And one great advantage with the long format is that you can easily illustrate more dimensions of the data if you have more categories with, for example, different symbols or sizes for the markers.

Another advantage with the long format occurs if the dataset changes, for example with the addition of another variable 'AMZN'. Then the name and the values of that variable will occur in the already existing columns instead of adding another one like you would for the wide format. This means that you actually won't have to change the code in:

fig_long = px.scatter(df_long, x= 'index', y = 'value',
                      color = 'variable',
                      trendline = 'ols')

... in order to add the data to the figure.

While for the wide format, you would have to specify y = ['GOOG', 'AAPL', 'AMZN'] in:

fig_wide = px.scatter(df_wide, x = 'index', y = ['AAPL', 'MSFT', 'AMZN'],
                      trendline = 'ols',
                     )

And I would strongly argue that this outweighs the slight inconvenience of speifying color = 'variable' in:

fig_long = px.scatter(df_long, x= 'index', y = 'value',
                      color = 'variable',
                      trendline = 'ols')

Plot 2 - A new variable:

enter image description here

Complete code

# imports
import pandas as pd
import plotly.express as px

# data
df = px.data.stocks()
# df.date = pd.to_datetime(df.date)
df_wide = df.drop(['date', 'GOOG', 'AMZN', 'NFLX', 'FB'], axis = 1).reset_index()
# df_wide = df.drop(['date', 'GOOG', 'NFLX', 'FB'], axis = 1).reset_index()
df_long = pd.melt(df_wide, id_vars = 'index')
df_long

fig_wide = px.scatter(df_wide, x = 'index', y = ['AAPL', 'MSFT'],
                      trendline = 'ols',
                     )

fig_long = px.scatter(df_long, x= 'index', y = 'value',
                      color = 'variable',
                      trendline = 'ols')

# fig_long.show()
fig_wide.show()
vestland
  • 55,229
  • 37
  • 187
  • 305
  • but how can I make this work with separate traces? – David 54321 Sep 02 '21 at 14:12
  • If you got time I'd love it if you could update it. Please! – David 54321 Sep 02 '21 at 23:55
  • @David54321 I'd be happy to. But first, could you just clarify what the *real* challenge is? Is it (1) that you have to use `go.Scatter()` or is it (2) that you've got your data organized in a dataframe with a wide format? If it's the latter, then the solution is very elegant. – vestland Sep 03 '21 at 06:57
0

Here's how I resolved it. Basically I used numpy polyfit function to calculation my slop. I then added the slop for each data set as a tracer

import numpy as np

df_m, df_b = np.polyfit(df_df['Circumference (meters)'].to_numpy(), df_df['Height (meters)'].to_numpy(), 1)
wp_m, wp_b = np.polyfit(df_wp['Circumference (meters)'].to_numpy(), df_wp['Height (meters)'].to_numpy(), 1)

fig = go.Figure()
fig.add_trace(go.Scatter(x=df_df['Circumference (meters)'], 
                         y=df_df['Height (meters)'], 
                         name='Douglas Fir', mode='markers')
             )
fig.add_trace(go.Scatter(x=df_df['Circumference (meters)'], 
                         y=(df_m*df_df['Circumference (meters)'] + df_b),
                         name='douglas fir trendline',
                         mode='lines')
             )
fig.add_trace(go.Scatter(x=df_wp['Circumference (meters)'], 
                         y=df_wp['Height (meters)'],  
                         name='White Pine',mode='markers'),
             )
fig.add_trace(go.Scatter(x=df_wp['Circumference (meters)'], 
                         y=(wp_m * df_wp['Circumference (meters)'] + wp_b),
                         name='white pine trendline',
                         mode='lines')
             )
fig.update_layout(title="Tree Circumference vs Height (meters)",
                  xaxis_title=df_df['Circumference (meters)'].name,
                  yaxis_title=df_df['Height (meters)'].name,
                  title_x=0.5)

fig.show()

enter image description here

David 54321
  • 568
  • 1
  • 9
  • 23