1

I'm trying to train the MLlib RandomForestRegression Model using the RandomForest.trainRegressor API.

After training, when I try to save the model the resulting model folder has a size of 6.5MB on disk, but there are 1120 small parquet files in the data folder that seem to be unnecessary and slow to upload/download to s3.

Is this the expected behavior? I'm certainly repartitioning the labeledPoints to have 1 partition but this is happening regardless.

zero323
  • 322,348
  • 103
  • 959
  • 935
x89a10
  • 681
  • 1
  • 8
  • 23

1 Answers1

0

Repartitioning with rdd.repartition(1) before training does not help much. It make training potentially slower because all parallel operation are effectively sequential as whole parallelism stuff is based on partitions.

Instead of that I've come up with simple hack and set spark.default.parallelism to 1 as the save procedure use sc.parallelize method to create stream to save.

Keep in mind that it will affect countless places in your app like groupBy and join. My suggestion is to extract train&save of the model to separate application and run it in isolation.

jjuraszek
  • 151
  • 1
  • 3