1

I'm looking for a way to optimize my query.

We have a table with events called lea, with a column app_properties, which are tags, stored as a comma separated string.

I would like to select all the events that match the result of a query that select the desired tags.

My first try:

SELECT uuid, app_properties, tag
FROM events
LATERAL VIEW explode(split(app_properties, '(, |,)')) tag_table AS tag
WHERE tag IN (SELECT source_value FROM mapping WHERE indicator = 'Bandwidth Usage')

But Hive will not allow this...

FAILED: SemanticException [Error 10249]: Line 4:6 Unsupported SubQuery Expression 'tag': Correlating expression cannot contain unqualified column references.

Gave it another try by replacing WHERE tag IN by WHERE tag_table.tag IN but not luck...

FAILED: SemanticException Line 4:6 Invalid table alias tag_table' in definition of SubQuery sq_1 [tag_table.tag IN (SELECT source_value FROM mapping WHERE indicator = 'Bandwidth Usage')] used as sq_1 at Line 4:20.

In the end... The query below gives the desired result, but I've a feeling that this is not the most optimized way of solving this use case. Has anyone ran into the same use case where you need the select from a LATERAL VIEW using a Sub query?

SELECT to_date(substring(events.time, 0, 10)) as date, t2.code, t2.indicator, count(1) as total
FROM events
LEFT JOIN (
    SELECT distinct t.uuid, im.code, im.indicator
    FROM mapping im
    RIGHT JOIN (
      SELECT tag, uuid
      FROM events
      LATERAL VIEW explode(split(app_properties, '(, |,)')) tag_table AS tag
      ) t
    ON im.source_value = t.tag AND im.indicator = 'Bandwidth Usage'
    WHERE im.source_value IS NOT NULL
) t2 ON (events.uuid = t2.uuid)
WHERE t2.code IS NOT NULL
GROUP BY to_date(substring(events.time, 0, 10)), t2.code, t2.indicator;
Phani
  • 5,319
  • 6
  • 35
  • 43
Bas
  • 597
  • 5
  • 10
  • 22

1 Answers1

1

The Hive subquery in the WHERE clause can be used with IN, NOT IN, EXIST, or NOT EXIST as follows. If the alias (see the following example for the employee table) is not specified before columns (name) in the WHERE condition, Hive will report the error Correlating expression cannot contain unqualified column references. This is a limitation of the Hive subquery.

From Apache Hive Essentials.

I guess this problem is also caused by subquery. events should have an alias