I have a Hive table that tracks the status of an object moving through stages of a process. The table looks like this:
hive> desc journeys;
object_id string
journey_statuses array<string>
Here's a typical example of a record:
12345678 ["A","A","A","B","B","B","C","C","C","C","D"]
The records in the table were generated using Hive 0.13's collect_list
and the statuses have an order (if order wasn't important, I would have used collect_set
). For each object_id, I'd like to abbreviate the journey to return the journey statuses in the order that they appear.
I wrote a quick Python script that reads from stdin:
#!/usr/bin/env python
import sys
import itertools
for line in sys.stdin:
inputList = eval(line.strip())
readahead = iter(inputList)
next(readahead)
result = []
for id, (a, b) in enumerate(itertools.izip(inputList, readahead)):
if id == 0:
result.append(a)
if a != b:
result.append(b)
print result
I planned to use this in a Hive transform
call. It seems to work when run locally:
$ echo '["A","A","A","B","B","B","C","C","C","C","D"]' | python abbreviate_list.py
['A', 'B', 'C', 'D']
However, when I add the file and try to execute within Hive, an error is returned:
hive> add file abbreviateList.py;
Added resource: abbreviateList.py
hive> select
> object_id,
> transform(journey_statuses) using 'python abbreviateList.py' as journey_statuses_abbreviated
> from journeys;
NoViableAltException( ... wall of Java error messages ... )
FAILED: ParseException line 3:2 cannot recognize input near 'transform' '(' 'journey_statuses' in select expression
Can you see what I'm doing wrong?