I am looking for some help/advise in the use of datashader to plot a large 2D data array as a series of points, colored by amplitude. The data I deal with is housed in several 2D HDF5 datasets, with a time index stored in a separate dataset. The second dimension of the data is a spatial dimension (distance in m), which is a non-uniform stepped series of floats. The datasets are typically very large (~1000 x >1000000), so I would like to dask to handle the construction of an out-of-core dataframe, where the y-location of the data is stored as the column header, the x-location is the frame index, and I want to color-map the points to the data value The problem I have comes when I want to plot this in datashader from the dask dataframe, currently, the only way I've found is to flatten the dataframe and create two corresponding 'x' and 'y' columns to house the index and y-locations. Can anyone help me understand whether this plotting is possible without the step of flattening the data?
This is an example of what I have done thus far:
import datashader as ds
import datashader.transfer_functions as tf
import numpy as np
import pandas as pd
import dask.dataframe as dd
import dask.array as da
import bokeh.plotting as bk
from bokeh.palettes import viridis
from datashader.bokeh_ext import InteractiveImage
bk.output_notebook()
# ------------------------
# This is a proxy for a function, which creates a delayed frame from
# a series of delayed pandas dataframes, each reading from a separate
# h5 dataset.
random_data = da.random.random((10000, 1000), chunks = (1000, 100))
frame = dd.from_array(random_data)
# ------------------------
# ------------------------
# Flatten the dataframe and create two additional arrays holding the x and y
# locations.
a = frame.compute() # I want to avoid this call on the whole dataframe
index = [a.index] * len(a.columns)
index = np.vstack(index).reshape((-1), order = 'F')
columns = [a.columns] * len(a.index)
columns = [item for sublist in columns for item in sublist]
data = a.values.flatten()
# ------------------------
# Now creating an in-memory frame for the data
plot_frame = pd.DataFrame(columns = ['x', 'y', 'z']) # Empty frame
plot_frame.x = index
plot_frame.y = columns[::-1] #Reverse column order to plot
plot_frame.z = data
# ------------------------
x_range = [a.index[0], a.index[-1]]
y_range = [a.columns[0], a.columns[-1]]
def create_image(x_range = x_range, y_range = y_range[::-1], w=500, h=500):
cvs = ds.Canvas(x_range=x_range, y_range=y_range, plot_height=h, plot_width=w)
agg = cvs.points(plot_frame, 'x', 'y', ds.mean('z'))
return tf.shade(agg, cmap = viridis(256))
def base_plot(tools='pan,wheel_zoom,reset, box_zoom, save'):
p = bk.figure(x_range = x_range, y_range = y_range, tools=tools,
plot_width=900, plot_height=500, outline_line_color=None,
min_border=0, min_border_left=0, min_border_right=0,
min_border_top=0, min_border_bottom=0, x_axis_type = 'datetime')
p.xgrid.grid_line_color = None
p.ygrid.grid_line_color = None
return p
p = base_plot()
InteractiveImage(p, create_image)
Can anyone recommend a method for handling this more effectively through the datashader pipeline?
Thanks in advance!