13

I try to write Hive Sql like that

SELECT count(1), substr(date, 1, 4) as year
FROM ***
GROUP BY year

But Hive cannot recognize the alias name 'year', it complains that: FAILED: SemanticException [Error 10004]: Line 1:79 Invalid table alias or column reference 'year'

One solution(Hive: SELECT AS and GROUP BY) suggest to use 'GROUP BY substr(date, 1, 4)'.

It works! However in some cases the value I want to group by may be generated from multiple lines of hive function code, it's very ugly to write code like

SELECT count(1), func1(func2(..........................)) AS something
FROM ***
GROUP BY func1(func2(..........................))

Is there any clean way in Hive to do that? Any suggestions?

Community
  • 1
  • 1
twds
  • 333
  • 1
  • 4
  • 15

3 Answers3

25

Specifying the position in Group By will solve your issue. This position number in Group By works even when SET hive.groupby.orderby.position.alias=true; (Hive 0.12)

SELECT count(1), substr(date, 1, 4) as year  
FROM ***
GROUP BY 2;
Derek Farren
  • 127
  • 1
  • 6
Partha Kaushik
  • 690
  • 4
  • 8
9

In Hive 0.11.0 and later, columns can be specified by position if hive.groupby.orderby.position.alias is set to true (the default is false). So setting set hive.groupby.orderby.position.alias=true; in your .hql (or .hiverc for a permanent solution) will do the trick and then you can type group by 2 for the above example. Source: hive language manual

Angelo Di Donato
  • 1,093
  • 9
  • 11
0

One solution that comes to mind is put GROUP BY to outer query:

SELECT count(*) , year FROM 
(
   SELECT substr(date, 1, 4) as year FORM ***
) inner
GROUP BY year

GL!

www
  • 4,365
  • 1
  • 23
  • 24
  • 1
    How about the efficiency? It seems that Hive will first generate a temporary table with the same size, and then make the aggregation operation? – twds Apr 05 '15 at 08:46
  • You are 100% correct it might influence adversely performance. But it seems to only way to omit function use in `Group by`. Question here is if hive evaluates function value twice in your regular query? My guess is that is evaluating it only one time. In this case your original query seems to be optimal. – www Apr 05 '15 at 09:13