4

I have a graph in my PostgreSQL database, for the sake of example let's define it so:

CREATE TABLE nodes (node_id INTEGER);
CREATE TABLE roads (road_id INTEGER, nodes INTEGER[]);
INSERT INTO nodes VALUES (1), (2), (3), (4), (5);
INSERT INTO roads VALUES (1, {1, 2}), (2, {3, 4}));

I want to create SQL query that returns the number of connected components of the graph, in this example the number is 3, because nodes 1/2 are connected, 3/4 as well, while 5 is not connected to anything.

I tried searching for find&union implementations in SQL but to no avail, I then turned to CTEs but I can't do it on my own, I was thinking of something like this:

WITH RECURSIVE cc(iterator_id, node_id, rank, iterator) AS
(
        SELECT row_number() OVER(), n.node_id, row_number() OVER (), 1 FROM nodes AS n
    UNION ALL
        # Something here that does the magic
)
SELECT
    COUNT(DISTINCT rank) AS no_of_cc
FROM
    cc,
    (SELECT COUNT(*) FROM nodes) AS last_iterator_id
WHERE iterator = last_iterator_id;

where in each iteration we update the ranks of rows whose iterator_id <= iterator. We iterate until iterator is equal to the biggest iterator_id but I can't think of the recursive part.

Can you help me find the number of connected components?

Gabriel's Messanger
  • 3,213
  • 17
  • 31
Ritave
  • 1,333
  • 9
  • 25
  • 1
    Is there any particular reason for you to use plain SQL or solution may be implemented as PL/pgSQL procedure? One more thing, I see you write in python. In PostreSQL you can write procedures in [PL/Python](http://www.postgresql.org/docs/9.4/static/plpython.html) which is basicly python plus some features to access DB relations. – Gabriel's Messanger Nov 01 '15 at 19:07
  • I guess simply because I don't know those technologies and didn't know they could help me. I'll try learning and doing it in PL/Python then – Ritave Nov 01 '15 at 19:23

3 Answers3

3

You now what? Despite I made recommendation to you to write store procedure in PL/Python, later I've decided to write that single-sql-query just for fun. Here's what I did. I used RECURSIVE CTE.

WITH RECURSIVE graph_search(node_id, connected_to, path, cycle) AS (
        SELECT node_id, connected_to, ARRAY[node_id], false FROM paths
    UNION 
        SELECT p.node_id, p.connected_to, gs.path || p.node_id, p.node_id=ANY(gs.path)
        FROM graph_search gs JOIN paths p ON gs.connected_to = p.node_id AND NOT gs.cycle
 ),
 paths AS (
    SELECT node_id, connected_to
    FROM (
        SELECT n.node_id, unnest(r.nodes) AS connected_to
        FROM nodes n JOIN roads r ON n.node_id = ANY(r.nodes)
    ) sub
    WHERE node_id <> connected_to
 ) 
SELECT count(DISTINCT component)
FROM (
        SELECT node_id,
               array_agg(DISTINCT reachable_node ORDER BY reachable_node) as component
        FROM (
            SELECT node_id, unnest(path) as reachable_node from graph_search 
        ) sub
        GROUP BY node_id
    UNION ALL /*need to append lonely nodes - they are components for themselves*/
        SELECT node_id, ARRAY[node_id]
        FROM nodes
        WHERE node_id NOT IN (SELECT node_id from paths)
) sub;
  • Firstly I need different representation of graph itself. Ordinary CTE named paths creates two-columned table with pairs of connected nodes.
  • Then I just slightly modified example from PostgreSQL manual so I have list of nodes and every nodes reachable from it.
  • Aggregation gives me graph's components.
  • At the end I count the distinct components.
Gabriel's Messanger
  • 3,213
  • 17
  • 31
  • Postgres eats my whole 1GB ram while trying to run this query and then drops the connection and stops the query. Are you sure this isn't an infinite loop? Or maybe I can force Postgres to work on disk instead of blowing up ram? The whole data set is 485mb large if that helps – Ritave Nov 02 '15 at 11:23
  • Wait, here each node has an array of friends. In worst case (all nodes as one connected component) that's 235gb of data that Postgres is trying to load. Not gonna happen :P – Ritave Nov 02 '15 at 11:36
  • I am sure there's no infinite loop (this query can handle even cyclic graphs) but size of dataset is important factor, as machine performance is. Algorithm here is rather naive because this kind of problems isn't SQL's favourites:) Though I wasn't tested it on such a big dataset. I've some idea of optimalization, If I come up with something useful I'll write it here later. BTW each postgres connection has limited RAM sandbox for itself. If it too small for operation, server start using disk what really makes execution time to increase. – Gabriel's Messanger Nov 02 '15 at 12:23
2

Solution above won't work if number of nodes is too big.

Most efficient solution (as long as you have enough RAM to read all the data) is to read the data into memory with language such as C or C++ and perform calculations there.

But if data size is too big and you have no choice, then you can probably do it this way:

(plpgssql implementation, assuming we have table roads (node1, node2))

CREATE TABLE node AS
  SELECT DISTINCT node1 AS id, node1 AS color
  FROM roads

  CREATE OR REPLACE FUNCTION merge_node()
  RETURNS VOID
AS
$$
DECLARE
left_to_do INT := 1;
counter INT :=1;
row record;
BEGIN
    DROP TABLE IF EXISTS t;
CREATE TEMP TABLE t  (
    node1 INT,
    prev INT,
    next INT
);

    WHILE left_to_do > 0
    LOOP
        WITH joined_table AS (
            SELECT roads.node1,
                   MAX (v1.color) AS prev,
                   MAX (v2.color) AS next
            FROM roads
            JOIN node v1 ON roads.node1 = v1.id
            JOIN node v2 ON roads.node2 = v2.id
            GROUP BY roads.node1
        )
        INSERT INTO t (node1, prev, next)
        SELECT node1,
               prev,
               next
        FROM joined_table
        WHERE prev < next;
        SELECT COUNT(*) INTO left_to_do FROM t;
        UPDATE node color
        SET color = t.next
        FROM t
        WHERE color.id = t.node1;
        DELETE FROM t;
        counter := counter + 1;
    END LOOP;
END;
$$
LANGUAGE plpgsql;

this should work better if node degrees are low compared to number of nodes. tested it on graph with 2.4 mln nodes and 24 mln edges and it takes about 30-60 minutes with indices. (In comparison, in C++ takes 2.5 minutes where most of the time it reads data from csv / writes data to csv)

Blomex
  • 305
  • 2
  • 12
1

The paper In-database connected component analysis describes a SQL-based algorithm. It uses quite a few tables to store the intermediate results. The paper evaluates the algorithm in the Apache HAWQ DBMS but it seems to be portable to PostgreSQL.

Gabor Szarnyas
  • 4,410
  • 3
  • 18
  • 42