1

I work in HDP 2.6.5 platformon using Hive (1.2.1000.2.6.5.0-292) on a simple database based on data from: https://grouplens.org/datasets/movielens/100k/
I have 4 tables named: genre, movies, ratings, users as below:

CREATE TABLE genre(genre string, genre_id int);
CREATE TABLE movies (movie_id INT, title STRING, rel_date DATE, video_rel_date STRING, 
imdb_url STRING, unknown INT, action INT, adventure INT, animation INT, childrens INT, 
comedy INT, crime INT, documentary INT, drama INT, fantasy INT, noir INT, horror INT, 
musical INT, mystery INT, romance INT, sci_fi INT, thriller INT, war INT, western INT) 
CLUSTERED BY (movie_id) INTO 12 BUCKETS STORED AS ORC;
CREATE TABLE ratings(user_id int, movie_id int, rating int, rating_time int);
CREATE TABLE users(user_id int, age int, gender char(1), occupation string, zip int);

I would like to write a query returning which genre of movies was watched most often by women and which by men? But the problem for me is the structure of the movies table where the movie genre is located:

1|Toy Story (1995)|1995-01-01||http://us.imdb.com/M/title-exact?Toy%20Story%20(1995)|0|0|0|1|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0

The last 19 fields are the genres, a '1' indicates the movie is of that genre, a '0' indicates it is not. Additionally movies can be in several genres at once. The gender is represented in 'users' table as 'M' or 'F' char. The required tables can be easily joined, but how to return and group the genres which are the columns names?

SELECT m.title, r.rating, u.gender
FROM movies m INNER JOIN ratings r ON (m.movie_id = r.movie_id) 
INNER JOIN users u ON (u.user_id = r.user_id);
leftjoin
  • 36,950
  • 8
  • 57
  • 116
uhlik
  • 105
  • 9
  • Why someone voted to close this question? To me it seems quite valid question for SO. There are not too many ways how to do it and nothing opinion based, check the answers please. The answers are different ways and OP can choose the best. OP clearly described the data structure, the task and initial attempt. – leftjoin Dec 30 '20 at 19:31

2 Answers2

0

Awful data model. You should have a table with one row per movie and genre.

To solve this problem, I would suggest unpivoting to aggregate:

select mg.*
from (select m.movie_id, u.gender, count(*) as cnt,
             rank() over (partition by gender order by count(*) desc) as seqnum
      from ((select movie_id, 'action' from movies where action = 1) union all
            (select movie_id, 'adventure' from movies where adventure = 1) union all
            . . .
           ) m join
           ratings r
           on r.movie_id = m.movie_id join
           users u
           on r.user_id = u.user_id
       group by m.movie_id, u.gender
      ) mg
where seqnum = 1;
Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
0

Make an array of genre columns placed in the order corresponding to genre_id, explode array and join by position in array with genre table. Like this(not tested):

   select s.title, s.genre, s.gender, s.rating, s.cnt
    from
    (select s.title, s.gender, s.rating, s.cnt, s.genre,
           rank() over (partition by s.gender order by s.cnt desc) as rnk
     from
       (
        select m.title, u.gender, r.rating, g.genre, count(*) over(partition by u.gender) cnt
          from    
            (select m.movie_id, m.title, e.id+1 as genre_id
               from movies m
                  lateral view 
                  posexplode (array(--place columns in a positions corresponding their genre_id
                              unknown, action, adventure, animation, childrens, 
                              comedy, crime, documentary, drama, fantasy, 
                              noir, horror, musical, mystery, romance, 
                              sci_fi, thriller, war, western
                                   )
                             )e as id, val
              where e.val=1
             ) m
             INNER JOIN ratings r ON (m.movie_id = r.movie_id) 
             INNER JOIN users   u ON (u.user_id = r.user_id)
             INNER JOIN genre   g ON (g.genre_id = m.genre_id)
         ) s
    ) s where rnk = 1

          
leftjoin
  • 36,950
  • 8
  • 57
  • 116
  • Thank you very much! Apart from the missing comma before rank(), the query works fine, I am only concerned with the duplicate value in the column 's.cnt' it is always '53332'. By the way, learning HQL on VirtualBox is a bit of a pain, those queries get back forever, although I know it is MapReduce under the hood is there any other faster way to test such queries? – uhlik Dec 30 '20 at 17:53
  • @uhlik if it is Hive, then use Tez as execution engine. `set hive.execution.engine=tez;` It also uses map-reduce primitives but builds more efficient DAG and eliminates intermediate writes into hdfs.But better try something else like databricks instead of virtualbox. – leftjoin Dec 30 '20 at 19:09
  • @ulik s.cnt is counted for the gender partition. You can use row_number if you want single row with highest count. Or you may want some aggregation on top of this query... I have no your data and can not experiment with. I just provided an idea how to do this – leftjoin Dec 30 '20 at 19:14
  • @uhlik Also look at Ignite https://medium.com/datadriveninvestor/hadoop-integration-with-apache-ignite-and-using-hive-with-ignitemr-db8b8c09110e. There are many tools in bigData. You can try and decide which are better or use all of them – leftjoin Dec 30 '20 at 19:48