1

I am trying to convert following MySQL query to Hive

MySQL Query

SELECT
    departments.dept_name,
    dept_emp.dept_no,
    gender,
    (count(*)/(select count(*) from employees)) AS Sex
FROM 
    employees,
    dept_emp,departments
WHERE 
    dept_emp.dept_no = departments.dept_no
    AND dept_emp.emp_no =  employees.emp_no
GROUP BY 
    dept_emp.dept_no, 
    departments.dept_name,
    gender
ORDER BY 
    dept_emp.dept_no;

Hive Query

WITH 
    q1 as (SELECT COUNT(*) AS TOTAL_COUNT FROM employees),
    q2 as (SELECT gender,COUNT(*) as gender_count FROM employees GROUP BY gender)
SELECT 
    departments.dept_name,
    dept_emp.dept_no,
    gender,
    gender_count/TOTAL_COUNT As Sex 
FROM 
    q1,
    q2,
    dept_emp,
    departments
WHERE 
    dept_emp.dept_no = departments.dept_no
    AND dept_emp.emp_no = dept_emp.emp_no
GROUP BY 
    dept_emp.dept_no, 
    departments.dept_name,
    q2.gender
ORDER BY 
    dept_emp.dept_no;

But I am getting error

SemanticException [Error 10025]: Line 3:53 Expression not in GROUP BY key : TOTAL_COUNT

Thank you in Advance!

GMB
  • 216,147
  • 25
  • 84
  • 135

2 Answers2

1

Apart from the error on missing non-aggregated columns in the GROUP BY clause, the logic in your new query seems different than the one in the legacy query (for example : subquery q2 computes something new... and has no join condition with other tables).

Hive does not support subqueries in the SELECT clause, but it does allow them in FROM and WHERE clauses. I would simply move the inline subquery to the FROM clause. Since it returns only one record, that would be a CROSS JOIN:

SELECT
    d.dept_name,
    de.dept_no,
    e.gender,
    (count(*)/x.cnt) AS Sex
FROM 
    employees e
    INNER JOIN dept_emp de ON de.emp_no =  e.emp_no
    INNER JOIN departments d ON de.dept_no = d.dept_no
    CROSS JOIN (SELECT COUNT(*) cnt FROM employees) x
GROUP BY 
    de.dept_no, 
    d.dept_name,
    e.gender
ORDER BY 
    de.dept_no;

NB1: always use explicit, standard JOINs instead of old-shool, implicit JOINs; I modified the query accordingly (and also added table aliases).

GMB
  • 216,147
  • 25
  • 84
  • 135
0

Actually you can easily fix exception in your query by using MAX() aggregation on columns which are not in group by or adding them to the group by. I totally agree with @GMB about explicit joins and also would like to add that you can eliminate cross join and employee table extra scan using analytic count():

SELECT
    d.dept_name,
    de.dept_no,
    e.gender,
    count(*)/max(e.total_cnt)  as Sex
FROM 
    (select emp_no, gender, 
            count(*) over() as total_cnt
       from employees e ) e
    INNER JOIN dept_emp de ON de.emp_no =  e.emp_no
    INNER JOIN departments d ON de.dept_no = d.dept_no
GROUP BY 
    de.dept_no, 
    d.dept_name,
    e.gender
ORDER BY 
    de.dept_no;
leftjoin
  • 36,950
  • 8
  • 57
  • 116