Software-versions
pandas: 1.3.3, datashader: 0.13.0, bokeh: 2.3.3, holoviews: 1.14.6
What I want to achieve/My current problem
I do some scatterplots of categorical data with bokeh/holoviews. Sometimes the sets are big so I want to use datashader.
But in many cases my data is too sparse to look any good (1672 points in this case). So I have to spread it. But it does not look good:
(Without spreading the data there are only about 9 pixels visible; I do not show a picture of this.)
For this small sizes it is also possible to use holoviews without datashader. There the picture looks much better:
Following the ideas in Datashader: categorical colormapping of GeoDataFrames I tried to use
aggregator=ds.by(cat_color, ds.any())
instead of aggregator=ds.by(cat_color)
in the datashade
-function.
The result is strange:
When you do not spread the result you get the same strange olive background-color but more transparent.
Interestingly this background-color is not always the same.
Reproducible code example
import numpy as np
import pandas as pd
import holoviews as hv
hv.extension('bokeh')
import datashader as ds
from datashader.colors import Sets1to3
from holoviews.operation.datashader import datashade,dynspread
raw_data = [('Alice', 60, 'London', 5) ,
('Bob', 14, 'Delhi' , 7) ,
('Charlie', 66, np.NaN, 11) ,
('Dave', np.NaN,'Delhi' , 15) ,
('Eveline', 33, 'Delhi' , 4) ,
('Fred', 32, 'New York', np.NaN ),
('George', 95, 'Paris', 11)
]
# Create a DataFrame object
df = pd.DataFrame(raw_data, columns=['Name', 'Age', 'City', 'Experience'])
df['City']=pd.Categorical(df['City'])
x='Age'
y='Experience'
color='City'
cats=df[color].cat.categories
# Make dummy-points (currently the only way to make a legend: https://holoviews.org/user_guide/Large_Data.html)
for cat in cats:
#Just to make clear how many points of a given category we have
print(cat,((df[color]==cat)&(df[x].notnull())&(df[y].notnull())).sum())
color_key=[(name,color) for name, color in zip(cats,Sets1to3)]
color_points = hv.NdOverlay({n: hv.Points([0,0], label=str(n)).opts(color=c,size=0) for n,c in color_key})
# Create the plot with datashader
points=hv.Points(df, [x, y],label="%s vs %s" % (x, y),)#.redim.range(Age=(0,90), Experience=(0,14))
datashaded1=datashade(points,aggregator=ds.by(color)).opts(width=550, height=480)
datashaded2=datashade(points,aggregator=ds.by(color,ds.any())).opts(width=550, height=480)
dynspread(datashaded1)*color_points+dynspread(datashaded2)*color_points
When you remove ds.any() then everything works more or less (there are some minor problems as discussed on https://github.com/holoviz/holoviews/issues/5070 ) but when doing ds.any()
the dynspread
does not work at all. This problem is also present in my actual data but I will probably just use spread
which works better. Is there a reason for this?
Is there something that I am missing?