9

I have the following three tables in SQL:

select * from movie limit 2;

  id  |           title            | year | content_rating | duration |    lang    |       country        |  gross   |  budget  | director_id 
------+----------------------------+------+----------------+----------+------------+----------------------+----------+----------+-------------
  407 | 102 Dalmatians             | 2000 | G              |      100 | English    | USA                  | 66941559 | 85000000 |        2174
 3699 | 10 Cloverfield Lane        | 2016 | PG-13          |      104 | English    | USA                  | 71897215 | 15000000 |        1327
(2 rows)

select * from actor limit 3;

  id  |         name         | facebook_likes 
------+----------------------+----------------
  408 | Christian Bale       |          23000
 1430 | Donna Murphy         |            553
   66 | Robert Downey Jr.    |          21000
(3 rows)

select * from acting limit 3;

 movie_id | actor_id 
----------+----------
      407 |     2024
     3699 |     1841
     3016 |       11
(3 rows)

Given two actors a1 and a2, I want to find the shortest path between a1 and a2.

For example, let's say a1 = 'Tom Cruise' and a2 = 'Robert Downey Jr'.

The output should be

Tom Cruise was in Days of Thunder with Robert Duvall -> Robert Duvall was in Lucky You with Robert Downey Jr.

In this case, Tom Cruise was 2 degrees away from Robert Downey Jr, with Robert Durvall connecting them. At most, I'd like to output up to 10 degrees, and after that ignore any connections.

I tried implementing the solution SQL query 6 degrees of separation for network analysis using recursive CTE but I don't think I've applied it properly. Help is appreciated, thanks in advance :)

Attempted query:

with recursive cte as (
select actor.name, movie.title, 1 as level from movie
left join acting on acting.movie_id = movie.id 
left join actor on actor.id = acting.actor_id
where actor.name = 'Tom Cruise'
union  
select actor.name, movie.title, level+1 from movie
left join acting on acting.movie_id = movie.id 
left join actor on actor.id = acting.actor_id
inner join cte on cte.name = actor.name
where cte.name = actor.name and cte.level < 10
)
select * from cte
Programmer
  • 1,266
  • 5
  • 23
  • 44

2 Answers2

5

I'm not sure what your second select in the query would return but here's a way to get the degrees of separation between actors:

Let's say we have a table of actor ids, Origin. In order to get all the actors that have played in the same movie as one of the actors in our table, we need to start with Origin, join with Acting and then Movie in order to get all the movies that our origin actors have played in, and then join with Acting again and the Actor table to get what we want. Notice that the Acting table appears two times. If we apply this to the recursive CTE and your question, noting that the Origin table would be Cte in your example, we get the following:

WITH RECURSIVE cte(id, distance) AS (
    SELECT actor.id, 0 
    FROM actor
    WHERE actor.name = 'Tom Cruise'

    UNION

    SELECT DISTINCT actor.id, cte.distance + 1
    FROM cte
    JOIN acting AS acting1 ON (cte.id = acting1.actor_id)
    JOIN movie ON (acting1.movie_id = movie.id)
    JOIN acting AS acting2 ON (movie.id = acting2.movie_id)
    JOIN actor ON (acting2.actor_id = actor.id)
    WHERE cte.id <> actor.id AND cte.distance + 1 <= 10
)

After this, the cte table will contain tuples of the type (id, dist), meaning that there exists a path from Tom Cruise to the actor with this id and with a distance of dist.

The DISTINCT is for efficiency reasons. There will be a lot of bad pairs in our Cte table (with the second value being larger than the true distance), especially if the actor graph is dense, but the correct tuple will be in the Cte table. By correct tuple I mean a tuple (actor, distance), such that distance is the shortest path between the starting actor (Tom Cruise, for instance) and this actor.

Edit: My bad, UNION does this already, so DISTINCT isn't needed for duplicates.

In order to get that distance, we add a select with a group by clause:

WITH RECURSIVE cte(id, distance) AS (
    SELECT actor.id, 0 
    FROM actor
    WHERE actor.name = 'Tom Cruise'

    UNION

    SELECT actor.id, cte.distance + 1
    FROM cte
    JOIN acting AS acting1 ON (cte.id = acting1.actor_id)
    JOIN movie ON (acting1.movie_id = movie.id)
    JOIN acting AS acting2 ON (movie.id = acting2.movie_id)
    JOIN actor ON (acting2.actor_id = actor.id)
    WHERE cte.id <> actor.id AND cte.distance + 1 <= 10
)
SELECT id, MIN(distance) AS distance
FROM cte
GROUP BY id
ORDER BY 2 ASC;

If you'd want to see the result for a given second actor, say Robert Downey Jr, then this would give you the answer regarding the degrees of separation:

WITH RECURSIVE cte(id, distance) AS (
    SELECT actor.id, 0 
    FROM actor
    WHERE actor.name = 'Tom Cruise'

    UNION

    SELECT actor.id, cte.distance + 1
    FROM cte
    JOIN acting AS acting1 ON (cte.id = acting1.actor_id)
    JOIN movie ON (acting1.movie_id = movie.id)
    JOIN acting AS acting2 ON (movie.id = acting2.movie_id)
    JOIN actor ON (acting2.actor_id = actor.id)
    WHERE cte.id <> actor.id AND cte.distance + 1 <= 10
), distance_table (id, distance) AS (
    SELECT id, MIN(distance) AS distance
    FROM cte
    GROUP BY id
)
SELECT 'Tom Cruise and ' || actor.name || ' are separated by ' ||
       COALESCE(TO_CHAR(distance_table.distance, '999999'), 'more than 10') || ' degrees of separation'
FROM actor
LEFT JOIN distance_table ON (actor.id = distance_table.id)
WHERE actor.name = 'Robert Downey Jr';

Although I don't think you'd generally want to compute this kind of information directly from the database, if you wanted to have a message telling the path between the actors, like the one you provided (Tom Cruise was in Days of Thunder with Robert Duvall -> Robert Duvall was in Lucky You with Robert Downey Jr.), then something like this could return that:

WITH RECURSIVE cte(id, name, distance, message) AS (
    SELECT actor.id, actor.name, 0, ''
    FROM actor
    WHERE actor.name = 'Tom Cruise'

    UNION

    SELECT actor.id, actor.name, cte.distance + 1, 
           cte.message || '> ' || cte.name || ' was in ' ||
           movie.title || ' with ' || actor.name || ' '
    FROM cte
    JOIN acting AS acting1 ON (cte.id = acting1.actor_id)
    JOIN movie ON (acting1.movie_id = movie.id)
    JOIN acting AS acting2 ON (movie.id = acting2.movie_id)
    JOIN actor ON (acting2.actor_id = actor.id)
    WHERE cte.id <> actor.id AND cte.distance + 1 <= 10
), distance_table (id, distance) AS (
    SELECT id, MIN(distance) AS distance
    FROM cte
    GROUP BY id
)
SELECT id, name, message, distance
FROM cte
WHERE (id, distance) IN (SELECT * FROM distance_table)
ORDER BY distance;
  • Hi Burcea, thanks for your answer! Regarding your last solution, is there any query to retrieve the appropriate information, and then I can just use php to print it as required? – Programmer Apr 20 '19 at 17:39
  • I think a good way to do it would be to retrieve from the database a set of (actor1_id, actor2_id, movie_id) tuples meaning that the actors have played in that movie and also (movie_id, movie_name) pairs. The first ones could be obtained with a query similar to the second subquery in the cte of my examples. Also be sure that if you have the tuple (actor1, actor2, movie), then you also have (actor2, actor1, movie). With this information, in your php code, apply a Breadth First Search algorithm to this data to get the path between two given actors. With this, you can print what you need. – Burcea Bogdan Madalin Apr 20 '19 at 18:22
  • My knowledge of PHP is limited, so I can't give an example of an implementation of the BFS algorithm in PHP. I could however provide one in a different programming language, like C++, Java, Python or Swift. Are you familiar with any of these languages? – Burcea Bogdan Madalin Apr 20 '19 at 18:28
4

Here's a stab at it (without CTEs). I happened to have a list of 4175 US city-state pairs. (Think state==movie and city==actor.)

Here's the setup from the table us:

SET NAMES utf8 COLLATE utf8_unicode_ci;

DROP TABLE IF EXISTS p_mapping;  -- state-city pairs (movie-actor)
CREATE TABLE p_mapping (
    state char(2) CHARACTER SET ascii NOT NULL,
    city varchar(255)  CHARACTER SET utf8 COLLATE utf8_unicode_ci  NOT NULL,
    PRIMARY KEY(state, city),
    INDEX(city, state)
) ENGINE=InnoDB;
INSERT INTO p_mapping (state, city)
    SELECT state, city  FROM us;

DROP TABLE IF EXISTS p_cities;  -- city ~= actor
CREATE TABLE p_cities (
    depth TINYINT UNSIGNED NOT NULL DEFAULT 0,
    from_state  char(2)  CHARACTER SET ascii  NOT NULL DEFAULT '',
    city  VARCHAR(255) CHARACTER SET utf8 COLLATE utf8_unicode_ci NOT NULL,
    PRIMARY KEY(city)
) ENGINE=InnoDB COMMENT 'SO 55717636';
INSERT INTO p_cities (city)
    SELECT DISTINCT city  FROM p_mapping;

DROP TABLE IF EXISTS p_states;  -- state ~= movie
CREATE TABLE p_states (
    depth TINYINT UNSIGNED NOT NULL DEFAULT 0,
    from_city  VARCHAR(255)  CHARACTER SET utf8 COLLATE utf8_unicode_ci  NOT NULL  DEFAULT '',
    state char(2)  CHARACTER SET ascii  NOT NULL,
    PRIMARY KEY(state)
) ENGINE=InnoDB COMMENT 'SO 55717636';
INSERT INTO p_states (state)
    SELECT DISTINCT state  FROM p_mapping;

-- I picked the goal of linking Omaha (only in Nebraska) to Birmingham (in multiple states.) First some initialization:

SET @city := 'Omaha';   -- starting here

UPDATE p_cities
    SET depth = 1
    WHERE city = @city;
UPDATE p_states  AS s
  JOIN p_mapping AS m  USING(state)
    SET s.from_city = @city,
        s.depth = 1
    WHERE m.city = @city;

SET @depth := 1;

-- Then repeat the following 10 times or until rows_affected drops to 0. It stopped after 3 iterations.

UPDATE  p_cities AS c
   JOIN p_mapping AS m  USING(city)
   JOIN p_states  AS s  USING(state)
    SET c.from_state = m.state,
        c.depth = s.depth + 1
    WHERE s.depth = @depth
      AND c.depth = 0;

SET @depth := @depth + 1;

UPDATE  p_states AS s
   JOIN p_mapping AS m  USING(state)
   JOIN p_cities AS c   USING(city)
    SET s.from_city = m.city,
        s.depth = c.depth
    WHERE c.depth = @depth
      AND s.depth = 0;

-- end loop (and end of algorithm)

-- A correct path: Omaha -> NE -> Columbus -> OH -> Athens -> AL -> Birmingham -- Notice how this lists the answer (but vertically):

SELECT * FROM p_cities
    WHERE city in ('Omaha', 'Columbus', 'Athens', 'Birmingham')
    ORDER BY depth;        

    +-------+------------+------------+
    | depth | from_state | city       |
    +-------+------------+------------+
    |     1 |            | Omaha      |
    |     2 | NE         | Columbus   |
    |     3 | OH         | Athens     |
    |     4 | AL         | Birmingham |
    +-------+------------+------------+
    4 rows in set (0.00 sec)

-- 'Proof' that the links work for the answer below:

SELECT * FROM p_mapping
    WHERE city IN ('Omaha', 'Columbus', 'Athens', 'Birmingham')
      AND state IN ('NE', 'OH', 'TN', 'AL');

    +-------+------------+
    | state | city       |
    +-------+------------+
    | AL    | Athens     |
    | OH    | Athens     |
    | TN    | Athens     |
    | AL    | Birmingham |
    | NE    | Columbus   |
    | OH    | Columbus   |
    | NE    | Omaha      |
    +-------+------------+
    7 rows in set (0.00 sec)

-- (The other table)

SELECT * FROM p_states
    WHERE from_city IN ('Omaha', 'Columbus', 'Athens', 'Birmingham')
       OR state IN ('NE', 'OH', 'TN', 'AL')
    ORDER BY depth;

    +-------+-----------+-------+
    | depth | from_city | state |
    +-------+-----------+-------+
    |     1 | Omaha     | NE    |
    |     2 | Columbus  | GA    |
    |     2 | Columbus  | IN    |
    |     2 | Columbus  | MS    |
    |     2 | Columbus  | OH    |
    |     3 | Athens    | AL    |
    |     3 | Athens    | TN    |
    +-------+-----------+-------+
    7 rows in set (0.00 sec)
Rick James
  • 135,179
  • 13
  • 127
  • 222