2

I'm working with Gaia astrometric data from the data release 3 and saw hvplot/datashader as the go-to for visualizing large data due to very fast render times and interactivity. In every example I'm seeing, it's taking a few seconds to render an image from hundreds of millions of data points on the slow end. However, when I try to employ the same code for my data, it takes hours for any image to render at all.

For context, I'm running this code on a very large research computer cluster with hundreds of gigs of RAM, a hundred or so cores, and terabytes of storage at my disposal, computing power should not be an issue here. Additionally, I've converted the data I need to a series of parquet files that are being read into a dask dataframe with glob. My code is as follows:

...

import dask.dataframe as dd
import hvplot.dask
import glob

df=dd.read_parquet(glob.glob(r'myfiles/*'),engine='fastparquet')
df=df.astype('float32')
df=df[['col1','col2']]
df.hvplot.scatter(x='col1',y='col2',rasterize=True,cmap=cc.fire)

...

does anybody have any ideas what could be the issue here? Any help would be appreciated

Edit: I've got the rendering times below an hour now by turning the data into a smaller number of higher memory files (3386 -> 175)

Rukonian
  • 21
  • 3
  • 1
    If you have hundreds of cores available in a cluster, you'll want to configure dask to use them. See the docs for dask distributed. Without setting up workers on other nodes, you'll just run on a single node, ignoring all the processing power available. – James A. Bednar Jan 16 '23 at 05:50
  • 1
    @JamesA.Bednar That was exactly the problem, fixed and rendering 1.5 billion points in a few seconds. Thank you very much. – Rukonian Jan 16 '23 at 22:21
  • Excellent! If you get a chance, please update your example to show the working version, as a reference for others. – James A. Bednar Jan 19 '23 at 03:37

1 Answers1

1

Hard to debug without access to the data, but one quick optimization you can implement is to avoid loading all the data and select the specific columns of interest:

df=dd.read_parquet(glob.glob(r'myfiles/*'), engine='fastparquet', columns=['col1','col2'])

Unless crucial, I'd also avoid doing .astype. It shouldn't be a bottleneck, but the gains from this float32 might not be relevant if memory isn't a constraint.

SultanOrazbayev
  • 14,900
  • 3
  • 16
  • 46
  • does filtering when reading in offer an advantage to filtering the dataframe after reading it in later processing steps? – Rukonian Jan 15 '23 at 16:58
  • AFAIU, in some edge cases it might have a big impact. In most cases I would expect the impact to be minimal. Take a look at column pruning section here: https://www.coiled.io/blog/parquet-file-column-pruning-predicate-pushdown – SultanOrazbayev Jan 15 '23 at 17:04