1

I am trying to write a UDF in python that will be called from a pig script. The UDF needs to accept the date as a string in DD-MMM-YYYY format and return DD-MM-YYYY format. Here MMM will be like JAN, FEB.. DEC and the return MM will be 01, 02... 12.

Below is my python UDF

#!/usr/bin/python

@outputSchema("newdate:chararray")
def GetMonthMM(inputString):
    print inputString
    #monthstring = inputString[3:6]
    sl = slice(3,6)
    monthstring = inputString[sl]
    monthdigit = ""

    if ( monthstring == "JAN" ):
        monthdigit = "01"
    elif ( monthstring == "FEB"):
        monthdigit = "02"
    elif(monthstring == "MAR"):
        monthdigit = "03"
    elif(monthstring == "APR"):
        monthdigit = "04"
    elif(monthstring == "MAY"):
        monthdigit = "05"
    elif (monthstring == "JUN"):
        monthdigit = "06"
    elif (monthstring == "JUL"):
        monthdigit = "07"
    elif (monthstring == "AUG"):
        monthdigit = "08"
    elif (monthstring == "SEP"):
        monthdigit = "09"
    elif (monthstring == "OCT"):
        monthdigit = "10"
    elif (monthstring == "NOV"):
        monthdigit = "11"
    elif (monthstring == "DEC"):
        monthdigit = "12"

    sl1 = slice(0,3)
    sl2 = slice(6,11)
    str1 = inputString[sl1]
    str2 = inputString[sl2]

    newdate = str1 + monthdigit + str2
    return monthstring;

I did some debugging and the issue seems to be that after the slicing the strings are being treated as arrays. I get the following error message

TypeError: unsupported operand type(s) for +: 'array.array' and 'str'

The same is happening even when the string is being compared to another string like at if (monthstring == "DEC"):. Even when monthstring has DEC as value the condition never satisfies.

Has anybody faced the same issue before? Any ideas how to fix this.

S L
  • 14,262
  • 17
  • 77
  • 116
Kiran Vajja
  • 23
  • 1
  • 3
  • 1
    Side note: why not use `dict` object containing pairs `"Jan":"01"` instead of this `elif` forest. Creating this `dict` might be easy using `calendar` module.. – quapka Apr 29 '16 at 17:30
  • Can't reproduce - your code works fine for me on Python 2.7.10, once I replace `return monthstring` with `return newdate`. **Also,** what line gives the error? Please edit your question and mark `#####` or something next to the error line. Thanks! – cxw Apr 29 '16 at 17:34
  • seems to work under python 2.7 – S L Apr 29 '16 at 17:43
  • I've tested this code with @cxw change mentioned above in IPython Notebook, with Python 3 and it works. – quapka Apr 29 '16 at 17:55
  • 1
    The code works fine when executed standalone in Python. I am getting the error when I register the function as UDF in a pig script and pass dates from the pig script. – Kiran Vajja Apr 29 '16 at 18:00
  • Add `print type(inputString)` at the top and please tell us what you get. I'm guessing it's not actually a `str`, so `str1` and `str2` sliced from it also aren't `str`s. If that is the case, you could try `str1 = str(inputString[sl1])` and same for `str2`. – cxw Apr 29 '16 at 18:11

2 Answers2

1

I would write this function as this:

#!/usr/bin/python
@outputSchema("newdate:chararray")
def GetMonthMM(inputString):
    monthArray = {'JAN':'01','FEB':'02','MAR':'03','APR':'04','MAY':'05','JUN':'06','JUL':'07','AUG':'08','SEP':'09','OCT':'10','NOV':'11','DEC':'12'}
    print inputString
    #monthstring = inputString[3:6]
    dateparts = string.join(inputString).split('-') #assuming the date is always separated by -
    dateparts[1] = monthArray[dateparts[1]]
    return dateparts.join('-');
Walter_Ritzel
  • 1,387
  • 1
  • 12
  • 16
  • The code fails at the split function with error message: AttributeError: 'array.array' object has no attribute 'split' – Kiran Vajja Apr 29 '16 at 18:18
  • The expected input should be a string, not an array... If the input is an array. then that line needs to be changed. – Walter_Ritzel Apr 29 '16 at 18:22
  • I have fixed the code to consider that your inputString is an array. – Walter_Ritzel Apr 29 '16 at 18:26
  • Thank you Walter. I guess that is where the problem is. Your previous code works fine when executed standalone in python shell. But when I call it from a pig script the Jython interpreter for some reason is considering the string as an array. – Kiran Vajja Apr 29 '16 at 18:41
1

Recently I've used the calendar module, might be more useful in different cases, but anyway.

import calendar
m_dict = {}
for i, month in enumerate(calendar.month_abbr[1:]): #for some reason month_abbr[0] = '', so ommit that
    m_dict[month.lower()] = '{:02}'.format(i+1)

def GetMonthMM(inputStr):
    day, month, year = inputStr.split('-')
    return '-'.join([day, m_dict[month.lower()], year])

print(GetMonthMM('01-JAN-2015'))
# prints 01-01-2015
quapka
  • 2,799
  • 4
  • 21
  • 35