2

I have three tables:

create table genres
(
genre_id serial primary key,
genre_name varchar NOT NULL UNIQUE
);

create table movies
(
movie_id serial primary key,
movie_name varchar NOT NULL           
);

create table movie_genres
(
movie_id integer references movies NOT NULL,
genre_id integer references genres NOT NULL, 
PRIMARY KEY(movie_id, genre_id)
);

Tables genres and movies are full of data and I want to generate some random data for table movie_genres, so that every movie has at least one genre. I tried it this way, but then it is possible for a movie to be without any genre. Can anyone help me with that, please?

insert into movie_genres 
select movie_id, genre_id 
from genres cross join movies 
where random() < 0.15;
abc
  • 39
  • 4

1 Answers1

1

Hmm, you can try to join a derived table in which you first select one random genre and then UNION some more randomly.

INSERT INTO movie_genres 
        (movie_id,
         genre_id)
SELECT m.movie_id,
       rg.genre_id
       FROM movies m
            CROSS JOIN ((SELECT g.genre_id
                                FROM genres g
                                ORDER BY random()
                                LIMIT 1)
                        UNION
                        (SELECT g.genre_id
                                FROM genres g
                                WHERE random() < 0.15)) rg;

That however means that every movie has that one genre selected first. To overcome this and have the first genre be random per movie, a lateral join can be used. (Remark: You need to use some column from the outer table in the derived table as otherwise the optimizer seems to optimize the LATERAL away.)

INSERT INTO movie_genres 
            (movie_id,
             genre_id)
SELECT rg.movie_id,
       rg.genre_id
       FROM movies m
            CROSS JOIN LATERAL ((SELECT g.genre_id,
                                        m.movie_id -- that's just here to force the optimizer to keep the join lateral
                                        FROM genres g
                                        ORDER BY random()
                                        LIMIT 1)
                                UNION
                                (SELECT g.genre_id,
                                        m.movie_id
                                        FROM genres g
                                        WHERE random() < 0.15)) rg;

db<>fiddle

sticky bit
  • 36,626
  • 12
  • 31
  • 42
  • `SELECT rg.* FROM …` to make the `movie_id` selection meaningful – Bergi Mar 27 '21 at 14:03
  • @Bergi: I don't understand your comment? We only need the `genre_id` from the derived table for the `INSERT`. See my remark as to why `movie_id` is in there. – sticky bit Mar 27 '21 at 14:05
  • Yes, I understand that, but if `rg.movie_id` is not used anywhere what keeps the optimiser from stripping that away as well? I was suggesting to use that column in the actual `INSERT` statement, instead of the `m.movie_id`. – Bergi Mar 27 '21 at 14:10
  • @Bergi: Fair point! I changed it to use the `movie_id` from the derived table. – sticky bit Mar 27 '21 at 14:14