Most algorithms that use matrix operations in spark have to use either Vectors or store their data in a different way. Is there support for building matrices directly in spark?
-
this question has some good info on the topic: http://stackoverflow.com/questions/24147186/how-to-build-a-large-distributed-sparse-matrix-in-apache-spark-1-0 – maasg Jun 12 '14 at 08:24
-
When working with big data I try to avoid algorithms that use matrix operations as they often don't scale well. Moreover linear algebra techniques in machine learning often stem from linearity, euclidean and Gaussian assumptions. When working with Big Data it's time to broaden your horizons and learn some new techniques :) – samthebest Jun 13 '14 at 09:51
2 Answers
Apache recently released Spark-1.0. It has support for creating Matrices in Spark, which is a really appealing idea. Although right now it is in experimental phase and has support for limited operations that can be performed over the Matrix you create but this is sure to grow in future releases. The idea of Matrix operations being performed with the speed of Spark is amazing.

- 4,128
- 6
- 28
- 47
The way I use matrices in Spark is through python and with numpy scipy. Pull the data into the matrices from a csv file and use as needed. I treated the matrices the same as I would in normal python scipy. It is how you parallelize the data that makes it slightly different.
Something like this:
for i in range(na+2):
data.append(LabeledPoint(b[i], A[i,:]))
model = WhatYouDo.train(sc.parallelize(data), iterations=40, step=0.01,initialWeights=wa)
The pain was getting numpy scipy into spark. Found the best way to make sure all the other libraries and files need were included was to use:
sudo yum install numpy scipy python-matplotlib ipython python-pandas sympy python-nose

- 213
- 3
- 13
-
Good but the performance would never match that achieved by using Spark's matrices themselves. Those would be optimised way more. – Pravesh Jain Feb 04 '15 at 05:29
-
1Yeah you are right. Use them because importing from normal python programs and using numpy for specific calculations. Working on making the process based on Spark to achieve efficiency as you point out. Thanks! – Jesse Feb 04 '15 at 20:23