0

I want to put some constants in one Python file and import it into another. I created two files, one with constants and one that imports it, and everything runs fine locally:

constants.py:

CONST = "hi guy"

test_constants.py:

from constants import CONST
import sys

for line in sys.stdin:
    print(CONST)

local test:

$ echo "dummy" | python test_constants.py
hi guy

Test using Hive (beeline):

hive> add file hdfs://path/.../test_constants.py;
No rows affected (0.191 seconds)
hive> add file hdfs://path/.../constants.py;
No rows affected (0.049 seconds)
hive> list files;
resource
/tmp/bb09f878-7e36-4aa2-8566-a30950072bcb_resources/test_constants.py
/tmp/bb09f878-7e36-4aa2-8566-a30950072bcb_resources/constants.py
2 rows selected (0.179 seconds)
hive> with t as (select 1 as dummy) 
  select transform (dummy) 
  using 'python test_constants.py' 
  as dummy_out 
  from t;
Error: org.apache.hive.service.cli.HiveSQLException: 
Error while processing statement: FAILED: 
Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask. 
Vertex failed, vertexName=Map 1, vertexId=vertex_1535407036047_170618_1_00, diagnostics=[Task failed, taskId=task_1535407036047_170618_1_00_000000, diagnostics=[TaskAttempt 0 failed, info=[Error: Error while running task ( failure ) : attempt_1535407036047_170618_1_00_000000_0:
java.lang.RuntimeException: java.lang.RuntimeException: Hive Runtime Error while closing operators

The logs look like this:

Log Type: stderr
Log Upload Time: Mon Oct 29 15:50:42 -0700 2018
Log Length: 251

2018-10-29 15:45:16 Starting to run new task attempt: attempt_1535407036047_170618_1_00_000000_3
Traceback (most recent call last):
  File "test_constants.py", line 1, in <module>
    from constants import CONST
ImportError: No module named constants

Both files appear to be in the same folder, so the import seems like it should work, but it doesn't.

Added 2018-10-30:

The answer by @serge_k works, however, I initially had trouble, since the path where I had my Python UDFs was not initially available to hive. After moving all of the files into /tmp on HDFS, everything worked as expected.

hive> add file hdfs://dev/tmp/transforms;
No rows affected (0.108 seconds)
hive> list files;
resource
/tmp/61ecb363-ead6-4679-8f58-3611db9487b2_resources/transforms
1 row selected (0.202 seconds)
hive> select transform (col) using 'python transforms/test_constants.py' as dummy_out from dummy.test;
dummy_out
hi guy
hi guy
hi guy
hi guy
hi guy
hi guy
hi guy
hi guy
hi guy
hi guy
10 rows selected (63.734 seconds)
Michael K
  • 2,196
  • 6
  • 34
  • 52

1 Answers1

1

Place your python scripts in one folder, e.g. files, add the whole folder to distributed cache and call the script as python files/script_name.py:

hive> add file ./files;
Added resources: [./files]
hive> with t as (select 1 as dummy) select transform (dummy) 
      using 'python files/test_constants.py' as dummy_out from t;

OK
hi guy
serge_k
  • 1,772
  • 2
  • 15
  • 21
  • I initially got an error around permissions, since the beeline user did not have execute access to the entire folder. By copying the folder of transforms to `/tmp` on HDFS, it was able to work. Thanks! – Michael K Oct 30 '18 at 21:52