4

I have a Hive table that tracks the status of an object moving through stages of a process. The table looks like this:

hive> desc journeys;
object_id           string                                      
journey_statuses    array<string>

Here's a typical example of a record:

12345678    ["A","A","A","B","B","B","C","C","C","C","D"]

The records in the table were generated using Hive 0.13's collect_list and the statuses have an order (if order wasn't important, I would have used collect_set). For each object_id, I'd like to abbreviate the journey to return the journey statuses in the order that they appear.

I wrote a quick Python script that reads from stdin:

#!/usr/bin/env python
import sys
import itertools

for line in sys.stdin:
    inputList = eval(line.strip())
    readahead = iter(inputList)
    next(readahead)
    result = []
    for id, (a, b) in enumerate(itertools.izip(inputList, readahead)):
        if id == 0:
          result.append(a)
        if a != b:
          result.append(b)
    print result

I planned to use this in a Hive transform call. It seems to work when run locally:

$ echo '["A","A","A","B","B","B","C","C","C","C","D"]' | python abbreviate_list.py
['A', 'B', 'C', 'D']

However, when I add the file and try to execute within Hive, an error is returned:

hive> add file abbreviateList.py;                                                                           
Added resource: abbreviateList.py

hive> select
    >   object_id,
    >   transform(journey_statuses) using 'python abbreviateList.py' as journey_statuses_abbreviated
    > from journeys;
NoViableAltException( ... wall of Java error messages ... )
FAILED: ParseException line 3:2 cannot recognize input near 'transform' '(' 'journey_statuses' in select expression

Can you see what I'm doing wrong?

Alex Woolford
  • 4,433
  • 11
  • 47
  • 80

1 Answers1

5

Apparently you can't SELECT other fields that are not in the transform (in your example, object_id). This other SO question seems to indirectly address that:

How can select a column and do a TRANSFORM in Hive?

You could in theory modify your Python to accept the object_id as an input parameter and make it a passthrough to another output field if you need it to be included in the output.

Community
  • 1
  • 1
rchang
  • 5,150
  • 1
  • 15
  • 25