1

I have a hive table in the format :

   col1.      col2.     col3.
    a1          b1       c1
    a1          b1       c2                                  
    a1          b2       c2
    a1          b2       c3              
    a2          b3       c1
    a2          b4       c1                                  
    a2          b4       c2
    a2          b4       c3              
    .
    .

Every value in col1 can have multiple values in col2 and every such pair of (col1, col2) can have multiple values of col3.

I am running the query [Q]:

select col1, col2, collect_list(col3) from {table} group by col1, col2;

to get:

a1   b1   [c1, c2]
a1   b2   [c2, c3]
a2   b3   [c1]
a2   b4   [c1, c2, c3] 

I want to do some transformations using a python UDF. So Im passing all these columns to UDF using TRANSFORM clause as:

select TRANSFORM ( * ) using 'python udf.py' FROM 
(
select col1, col2, concat_ws('\t', collect_list(col3)) from {table} group by col1, col2;
)

I'm using concat_ws to convert the array output to strig from collect_list concatenated by separator. I get col1, col2 in result, but do not get the col3 output.

+---------+---------+
|      key|    value|
+---------+---------+
|a1       | b1      |
|         |     null|
|a1       | b2      |
|         |     null|
|a2       | b3      |
|         |     null|
|a2       | b4      |
|         |     null|
+---------+---------+

In my UDF, I just have a print statement that prints the line received from stdin.

import sys
for line in sys.stdin:
    try:
        print line
    except Exception as e:
        continue

can someone help figure out why i'm not getting the col3 in my UDF ?

martian
  • 23
  • 6
  • Can you post the code of your UDF? it's hard to say what's wrong without the code. Plus `\t` in `concat_ws` is not the best separator as tab is used to separate columns. As alternative you may use comma or semicolon. – serge_k Aug 21 '20 at 08:28
  • @serge_k I have added the UDF code. Also, I have already tried comma, semicolon etc with `concat_ws`. Still getting the same result. Is there any other alternative function to use in place of `concat_ws` ? – martian Aug 22 '20 at 12:24

1 Answers1

1

First, you need to parse the line in Python UDF, e.g.,

import sys
for line in sys.stdin:
    try:
        line = line.strip('\n')
        col1, col2, col3 = line.split('\t')
        print '\t'.join([col1, col2, col3])
    except Exception as e:
        continue

Then it's better to use something else instead of \t in concat_ws

select TRANSFORM ( * )  using 'python udf.py' as (col1, col2, col3)
FROM 
(
select col1, col2, concat_ws(',', collect_list(col3)) from {table} group by col1, col2;
serge_k
  • 1,772
  • 2
  • 15
  • 21
  • Thank you. This worked. I had tried split('\t') before, not sure what was problem then. – martian Aug 23 '20 at 08:33
  • Anyway, in my UDF, I'm trying to aggregate all info for each unique col1 in a dict and dump to json. something like: `d = { "col1": "a1", "col2": [ "b1", "b2"], col3: ["c1", "c2", "c3"]}` So in my UDF, I do a `print json.dumps(d)` after i've created the JSON by agg. of rows. But I end up getting: `{"col1": "a1", "col2" : "b1", "col3": ["c1", "c2"] }`, `{"col1": "a1", "col2" : "b2", "col3": ["c2", "c3"] }` `{"col1": "a2", "col2" : "b3", "col3": ["c1"] } ... ` I would expect `group by` to send rows in UDF grouped by the given columns, so the agg on col1 would work. Not sure whats wrong. – martian Aug 23 '20 at 08:48