1

I am running RowMatrix.computeSVD using scala, in UI it appears that one stage only the "treeAggregate" is running on the cluster and after that the UI of the application master shows nothing while the application continues to execute the computeSVD. so i am assuming that only the "treeAggregate" is running on the cluster and the rest on the driver.

Is there a way to let all the compute SVD to run on the cluster? the Driver normally has limited resources and computeSVD take a long time for a matrix of 9446*9446.

Francois Saab
  • 77
  • 1
  • 9

1 Answers1

0

Unfortunately it looks like modifying strategy is not possible without tinkering with private API.

Depending on the number of columns and k Spark automatically adjusts computation strategy and fully distributed mode with multiple passes is used only if both numbers are large and k is relatively high compared to the number of columns.

At the first glance you could trigger distributed computation by keeping k between nCol / 3 and ncol / 2.

zero323
  • 322,348
  • 103
  • 959
  • 935
  • OK zero323, i need to conserve the dimensions so i am using k= 9446. am using SVD to later on perform the inverse of the original matrix. i think this is large number but i am still not seeing the job distributed on cluster, it runs on driver – Francois Saab Aug 25 '16 at 19:46
  • zero323, what if i want to use full rank? – Francois Saab Aug 25 '16 at 19:50
  • If don't have a decent solution here. Since all the way up required classes are `private` to `mllib` there is not much choice here. Either you re-implement or intentionally break access limitations. – zero323 Aug 25 '16 at 19:58
  • ok , i am convinced this is how it is implemented. thank you zero323 – Francois Saab Aug 25 '16 at 20:20
  • @ zero323, kindly can you tell me how i can "break access limitations" – Francois Saab Oct 04 '16 at 06:01