0

I want to do Random Forest predictions on a large dataset and save the result as an dataframe. I read https://examples.dask.org/machine-learning/parallel-prediction.html and it says "Workers can write the predicted values to a shared file system, without ever having to collect the data on a single machine", but I cant figure out how to do this. I tried this by connecting to a distributed cluster and doing:

x = da.from_array(i,100000)
t = model.predict(x)
t= client.persist(t)
df=dd.from_array(t)
df.to_parquet("xy.parquet")

However this does not trigger any computation on the cluster (observed with dashboard), and runs my 1TB RAM machine into a memory error when to_parquet computes, even for a test where the numpy size of x and t is 7GB. Anything else I submit to the cluster is computed there. So how do I save the results of the prediction?

EDIT: This seems to be an issue of size for the input x. It has the shape (24507731,8). If I instead just throw in random data with the shape (24507,8) the computation finished. This is quite surprising as ParallelPostfit is supposed to make prediction on large data possible in the first place.

Wacken0013
  • 11
  • 3

0 Answers0