1

I've implemented Warshall's algorithm in a MySQL Stored Procedure. Unfortunately the procedure takes a long time to complete. I'm a beginner in writing Stored Procedures, do you know what I can do, to make it faster?

Brief explanation: I'm trying to calculate the transitive closure of a adjacency list. I want to know, which nodes are connected (directly over one edge, or indirectly over more edges). For example:

Input:  (1, 2), (2, 3), (3, 4)
Output: (1, 2), (1, 3), (1, 4), (2, 3), (2, 4), (3, 4)

Following Image illustrates the graph:


Image from Wikimedia Commons: https://en.wikipedia.org/wiki/File:Transitive-closure.svg

You can view the code on SQL Fiddle or here:

# "Warshall's algorithm" to calculate the transitive closure
# (1) For k = 1 to n
# (2)   For i = 1 to n
# (3)     If d[i,k] = 1
# (4)       For j = 1 to n
# (5)         If d[k,j] = 1 : d[i,j] = 1
create procedure closure()
begin
    drop table if exists adjMatrix;
    drop table if exists idArray;
    create temporary table adjMatrix (idFrom int not null, idTo int not null,
                                      primary key (idFrom, idTo));
    create temporary table idArray (id int);
    insert into adjMatrix select parent_id, id
                          from article where parent_id is not null;
    insert into idArray select id from article;
    blockk: begin
        declare k, fink int;
        declare ck cursor for select id from idArray;
        declare continue handler for not found set fink = 1;
        open ck;
        loopk: loop
            fetch ck into k;
            if fink = 1 then
                leave loopk;
            end if;
            blocki: begin
                declare i, fini int;
                declare ci cursor for select id from idArray;
                declare continue handler for not found set fini = 1;
                -- select k from dual;
                open ci;
                loopi: loop
                    fetch ci into i;
                    if fini = 1 then
                        leave loopi;
                    end if;
                    blockj: begin
                        if exists (select * from adjMatrix where idFrom=i and idTo=k)
                        then
                            blockx: begin
                                declare j, finj int;
                                declare cj cursor for select id from idArray;
                                declare continue handler for not found set finj = 1;
                                open cj;
                                loopj: loop
                                    fetch cj into j;
                                    if finj = 1 then
                                        leave loopj;
                                    end if;
                                    if exists (select * from adjMatrix
                                               where idFrom=k and idTo=j) then
                                        insert into adjMatrix values (i, j);
                                    end if;
                                end loop loopj;
                                close cj;
                            end blockx;
                        end if;
                    end blockj;
                end loop loopi;
                close ci;
                -- select idFrom, idTo from adjMatrix order by idFrom, idTo;
            end blocki;
        end loop loopk;
        close ck;
    end blockk;
    insert into adjMatrix select id, id from article where parent_id is null;
    select idFrom, idTo from adjMatrix order by idFrom, idTo;
    drop temporary table adjMatrix;
    drop temporary table idArray;
end//

Running the procedure on a table with 1466 records takes 45 seconds on my computer:

mysql> call closure;
+--------+------+
| idFrom | idTo |
+--------+------+
|      1 |    1 |
|      1 |    2 |
|      1 |    3 |
|      1 |    4 |
|      1 |    5 |
|      2 |    3 |
|      2 |    4 |
|      2 |    5 |
|      3 |    4 |
|      3 |    5 |
|      4 |    5 |
~        ~      ~
|   1587 | 1587 |
|   1588 | 1588 |
|   1589 | 1589 |
+--------+------+
1472 rows in set (45.58 sec)
Christian Ammer
  • 7,464
  • 6
  • 51
  • 108
  • I don't understand the output of your procedure. Floyds algorithm must give a matrix (n x n) of the shortest routes between i and j. And your procedure gave a n x 2 matrix. What do the numbers in it mean? Why are you recreating adjacency matrix every time you run the function? – user4035 Jun 13 '13 at 21:00
  • @user4035: Sorry I should have explained a bit more, I will edit my answer. What I'm doing is not calculating the shortest routes, I'm trying to find which nodes are connected through one ore more edges. – Christian Ammer Jun 13 '13 at 21:08

1 Answers1

1

Warning: Since I'm not familiar with mysql I've 'translated' the problem into MSSQL so you'll need to do some effort to translate it back =)

I'm guessing that the reason that things are slowish is because SQL doesn't really lends itself to this kind of operations; it doesn't "like" branching and looping and all those things. What it does like A LOT is matching data from one table to another table, preferably in big heaps. (Think about the R in RDBMS)

So, to speed up your stored procedure, you could either switch to a different programming language that is more appropriate for this style of coding; or you could translate the problem into something more befitting SQL. Off course the latter is much more fun ! =)

Thinking about the issue a bit I came up with this:

CREATE TABLE #result (idFrom int not null, idTo int not null, primary key (idFrom, idTo));

INSERT INTO #result
SELECT parent_id, id
  FROM article 
 WHERE parent_id is not null;

 WHILE @@ROWCOUNT > 0
    BEGIN
        INSERT INTO #result 
        SELECT DISTINCT f.idFrom, t.idTo
          FROM #result f
          JOIN #result t
            ON t.idFrom = f.idTo
         WHERE NOT EXISTS ( SELECT *
                              FROM #result old
                             WHERE old.idFrom = f.idFrom
                               AND old.idTo = t.idTo )
    END

SELECT * FROM #result ORDER BY idFrom, idTo

Again, this is TSQL (the SQL dialect used by MSSQL) but I'm guessing it should be pretty straightforward to convert it to mysql (??).

What it does is :

  • create a temporary table #result pretty much identical to your adjMatrix table
  • load the 'direct links' from the source table in them
  • insert all 'secondary combinations' by matching one records idTo to another records idFrom; making sure that it can't find said combination in the table yet and make sure said list only has unique combinations (distinct)
  • if new records (and thus combinations) were added, see if we can add 'the next' layer.

Once nothing new is found anymore, return the results

e.g.: Given the input (1, 2), (2, 3), (3, 4)

will first fill #result with (1, 2), (2, 3), (3, 4) and then go into the loop:

iteration1 will match the following records:

  • (1,2)-(2,3) => (1,3)
  • (2,3)-(3,4) => (2,4)

and add those to #result where we thus find (1, 2), (2, 3), (3, 4), (1, 3), (2, 4)

iteration2 will match the following records:

  • (1,2)-(2,3) => (1,3) but it will get eliminated due to the WHERE NOT EXISTS()
  • (1,2)-(2,4) => (1,4)
  • (2,3)-(3,4) => (2,4) but it will get eliminated due to the WHERE NOT EXISTS()
  • (1,3)-(3,4) => (1,4)

It will then DISTINCT said list and only one instance of (1,4) remains and will be added thus find (1, 2), (2, 3), (3, 4), (1, 3), (2, 4), (1 ,4)

iteration4 will match the following records:

  • (1,2)-(2,3) => (1,3) but it will get eliminated due to the WHERE NOT EXISTS()
  • (1,2)-(2,4) => (1,4) but it will get eliminated due to the WHERE NOT EXISTS()
  • (2,3)-(3,4) => (2,4) but it will get eliminated due to the WHERE NOT EXISTS()
  • (1,3)-(3,4) => (1,4) but it will get eliminated due to the WHERE NOT EXISTS()

As all new records get eliminated we end up with a zero-rowcount and hence jump out of the loop.

I've tried the algorithm with the SqlFiddle input too and the result is pretty much instantaneously but I ended up with an output of 15 records instead of 16. It seems that your code also includes a (1, 1) record which kind of surprised me ?!

Anyway, hope this helps. After some time working with SQL you'll automatically learn to operate on blocks of data instead of working 'value by value' which is something that can bring any RDBMS to its knees. It usually works fine for a small set of data but performance goes south once more data gets put in. Well written, set-based SQL hardly notices the difference between working on 100 or 100000 records. (so to speak =)

deroby
  • 5,902
  • 2
  • 19
  • 33
  • +1: Thank you for your effort and this alternative approach. I think I understand your idea. If I find some time I will try to translate it to MySQL and test it with my dataset. – Christian Ammer Jan 28 '14 at 07:38