0

We are trying to write UDFs of Hive in Python to clean the data. The UDF we tried was using Pandas and it is throwing the error.

When we try using another python code without the Pandas it is working fine. Kindly help to understand the problem. Providing Pandas code below:

We have already tried various ways of Pandas but unfortunately no luck. As the other Python code without Pandas is working fine,we are confused why is it failing?

import sys
import pandas as pd
import numpy as np
for line in sys.stdin:
    df = line.split('\t')
    df1 = pd.DataFrame(df)
    df2=df1.T
    df2[0] = np.where(df2[0].str.isalpha(), df2[0], np.nan)
    df2[1] = np.where(df2[1].astype(str).str.isdigit(), df2[1], np.nan)
    df2[2] = np.where(df2[2].astype(str).str.len() != 10, np.nan, 
    df2[2].astype(str))
    #df2[3] = np.where(df2[3].astype(str).str.isdigit(), df2[3], np.nan)
    df2 = df2.dropna()
    print(df2)

I get this error:

FAILED: Execution Error, return code 20003 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask. An error occurred when trying to close the Operator running your custom script.
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1   HDFS Read: 0 HDFS Write: 0 FAIL
Total MapReduce CPU Time Spent: 0 msec
Bsquare ℬℬ
  • 4,423
  • 11
  • 24
  • 44
s.c.
  • 23
  • 6
  • You cannot return pandas.DataFrame object as output of your Python UDF, to make it work properly you supposed to return string with tab as filed delimiter and `\n` as line separator if you need multi line output, e.g., `1\t2\n3\t4`. So you need to convert your `df2` to string – serge_k Apr 18 '19 at 07:05

1 Answers1

0

I think you'll need to look at the detailed job logs for more information. My first guess is that Pandas is not installed on a data node.

This answer looks appropriate for you if you intend to bundle dependencies with your job: https://stackoverflow.com/a/2869974/7379644

Douglas M
  • 1,035
  • 8
  • 17