3

Is there a good SQL PageRank implementation out there? I've looked at http://www.databasedevelop.com/197517/, but it is lacking in legibility and correct (T-SQL) syntax.

While we're at it, does anyone know what kind of SQL the above link is using? What SQL uses 'is' at random places, 'where' with nothing after, that weird AT keyword, and the like?

Bondolin
  • 2,793
  • 7
  • 34
  • 62
  • The linked T-SQL is horrible. Using cursors is not the right tool. PageRank maps perfectly to a map-reduce style SQL query. – usr Jul 22 '13 at 13:05
  • What is map-reduce style, please? – Bondolin Jul 22 '13 at 13:14
  • 2
    Map-reduce is the modern buzzword for a `GROUP BY` query. Google for "page rank map reduce" to find out more. – usr Jul 22 '13 at 13:23
  • Thanks, that was helpful. But is there a SQL implementation? – Bondolin Jul 22 '13 at 13:52
  • And I just found out I can't do map-reduce on our system... – Bondolin Jul 22 '13 at 14:05
  • I have done pagerank in T-SQL. I have no time to clean up the code an post it alas. But it is easy enough that you can convert a generic page rank tutorial into SQL code. (Also, every SQL Server can do map reduce. It just is not called that way). – usr Jul 22 '13 at 14:10

2 Answers2

6

I just implemented the PageRank Algorithm in SQL. The algorithm work in following progress.

1.Calculate PageRank initial rank values;

2.Joining PageRank table with Edge table to emit rank values to adjacent nodes, and using aggregate function sum to "collect" received values. And then, saving the result to a temple table TmpRank;

  1. Swapping the content of PageRank and TmpRank, and go to step 2 until the convergent condition is satisfied or maximum iterated time is reached.

Here is the code:

-- The graph data and algorithm source from the book "Mining of Massive Datasets", P175, http://infolab.stanford.edu/~ullman/mmds/book.pdf
-- This script has been verified the correctness in SQL Server 2017 Linux Version.
DROP TABLE Node;
DROP TABLE Edge;
DROP TABLE OutDegree;
DROP TABLE PageRank;
CREATE TABLE Node(id int PRIMARY KEY);
CREATE TABLE Edge(src int,dst int, PRIMARY KEY (src, dst));
CREATE TABLE OutDegree(id int PRIMARY KEY, degree int);
CREATE TABLE PageRank(id int PRIMARY KEY, rank float);
CREATE TABLE TmpRank(id int PRIMARY KEY, rank float);

--delete all records
DELETE FROM Node;
DELETE FROM Edge;
DELETE FROM OutDegree;
DELETE FROM PageRank;
DELETE FROM TmpRank;

--init basic tables
INSERT INTO Node VALUES (0);
INSERT INTO Node VALUES (1);
INSERT INTO Node VALUES (2);
INSERT INTO Node VALUES (3);

INSERT INTO Edge VALUES (0, 1);
INSERT INTO Edge VALUES (0, 2);
INSERT INTO Edge VALUES (0, 3);
INSERT INTO Edge VALUES (1, 0);
INSERT INTO Edge VALUES (1, 3);
INSERT INTO Edge VALUES (2, 2);
INSERT INTO Edge VALUES (3, 1);
INSERT INTO Edge VALUES (3, 2);

--compute out-degree
INSERT INTO OutDegree
SELECT Node.id, COUNT(Edge.src) --Count(Edge.src) instead of Count(*) for count no out-degree edge
FROM Node LEFT OUTER JOIN Edge
ON Node.id = Edge.src
GROUP BY Node.id;

--WARN: There's no special process for node with out-degree, This may cause wrong result
--      Please to make sure every node in graph has out-degree

DECLARE @ALPHA float = 0.8;
DECLARE @Node_Num int;
SELECT @Node_Num = COUNT(*) FROM Node;

--PageRank Init Value
INSERT INTO PageRank
SELECT Node.id, rank = (1 - @ALPHA) / @Node_Num
FROM Node INNER JOIN OutDegree
ON Node.id = OutDegree.id

/*
--For Debugging
SELECT * FROM Node;
SELECT * FROM Edge;
SELECT * FROM OutDegree;
SELECT * FROM PageRank;
SELECT * FROM TmpRank;
*/

DECLARE @Iteration int = 0;

WHILE @Iteration < 50
BEGIN
--Iteration Style
    SET @Iteration = @Iteration + 1

    INSERT INTO TmpRank
    SELECT Edge.dst, rank = SUM(@ALPHA * PageRank.rank / OutDegree.degree) + (1 - @ALPHA) / @Node_Num
    FROM PageRank
    INNER JOIN Edge ON PageRank.id = Edge.src
    INNER JOIN OutDegree ON PageRank.id = OutDegree.id
    GROUP BY Edge.dst

    DELETE FROM PageRank;
    INSERT INTO PageRank
    SELECT * FROM TmpRank;
    DELETE FROM TmpRank;
END

SELECT * FROM PageRank;
geng liang
  • 90
  • 1
  • 6
0

Depending on your SQL Server Version have a look at the OFFSET_FETCH window function. There is a lot of page ranking application for this. Of course this would require 2012.

I've also used SSIS and an NTILE() divided temporary table to accomplish paging in the absense of the ability to use OFFSET_FETCH. Usually use something like the record count divided by some maximum number I want to see in the page as the seed for NTILE call.

For whatever reason I can't even open your link so hopefully this is what your asking for.

MSDN - OFFSET_FETCH

MSDN - NTILE

Matt
  • 1,441
  • 1
  • 15
  • 29