1

I want to implement an equation similar to the one in the page rank algorithm using pyspark.

In tradition way it is simple to implement, but when I come to project the implementation in pyspark I got stuck.

Let say we have a Matrix W of dimension (n*n) and a vector x which is initially initialized as (1/n,...,1/n) where n is the number of row in W.

Suppose W is given as pyspark data-frame for example:

src dst weight
a    b    0.5
a    c    0.2
etc

where each row is equivalent to an entry in W. For example, in row a and column b we have the value 0.5. I want to implement the equation:

x1 = Px
x = x1

Then repeat the above two actions m times, where m is given as input.

Any hint on how to implement the above action will be greatly appreciated.

moudi
  • 137
  • 11
  • @jgp your help please :) – moudi Apr 05 '19 at 07:15
  • 1
    A matrix and a dataframe are not the same. The order of lines is important in matrices but not in dataframes. Dataframe can store matrices (one per line for example) and you can then work on each matrix, but you cannot consider a dataframe as a matrix. – Steven Apr 05 '19 at 13:06
  • Spark (and pyspark) is a stream-processing tool. You are trying to run a matrix-processing algorithm on it. Maybe it's not the best tool for the job. Try to reformulate your algorithm so that it would work on a stream of data of fixed width (and likely smaller than your projected `n`), ideally allowing parallel processing of parts of the stream. If it's hard to do, likely spark is not a good choice. – 9000 Apr 05 '19 at 18:24

0 Answers0