debugging hadoop streaming progam

Question

I have data in form

  id,    movieid      , date,    time
 3710100, 13502, 2012-09-10, 12:39:38.000

Now basically what I want to do is this..

I want to find out, how many times a particular movie is watched between 7 am and 11 am at 30 minute interval

So basically..

How many times movie has been watched between

  6 and 6:30
  6:30 and 7
   7 and 7:30
   ...
   10:30-11

So i wrote mapper and reducer to achieve this.

mapper.py

#!/usr/bin/env python

import sys

# input comes from STDIN (standard input)
for line in sys.stdin:
    # remove leading and trailing whitespace
    line = line.strip()
    # split the line into words
    line = line.split(",")
    #print line

    print '%s\t%s' % (line[1], line)

reducer.py

#!/usr/bin/env python

import sys
import datetime
from collections import defaultdict



def convert_str_to_date(time_str):
    try:
        timestamp =   datetime.datetime.strptime(time_str, '%Y-%m-%d:%H:%M:%S.000')  #00:23:51.000


        return timestamp

    except Exception,inst:

        pass

def is_between(time, time1,time2):
    return True if time1 <= time < time2 else False


def increment_dict(data_dict, se10,date_time):
    start_time = datetime.datetime(date_time.year,date_time.month,date_time.day, 07,00,00)
    times = [start_time]
    for i in range(8):
        start_time += datetime.timedelta(minutes = 30 )
        times.append(start_time)
    for i in range(len(times) -1 ):
        if is_between(date_time, times[i], times[i+1]):
            data_dict[se10][i] += 1






keys = [0,1,2,3,4,5,6,7]



data_dict = defaultdict(dict)


# input comes from STDIN
def initialize_entry(se10):
    for key in keys:
        data_dict[se10][key] = 0

for line in sys.stdin:
    # remove leading and trailing whitespace
    line = line.strip()


    # parse the input we got from mapper.py

    se10, orig_data = line.split('\t')
    initialize_entry(se10)
    parse_line = orig_data.split(",")

    datestr = parse_line[2].replace(" ","").replace("'","")
    timestr = parse_line[3].replace(" ","").replace("'","")

    date_time = datestr + ":" + timestr

    time_stamp = convert_str_to_date(date_time)

    increment_dict(data_dict, se10,time_stamp)


for key, secondary_key in data_dict.items():
    for skey, freq in secondary_key.items():
        print key,"," ,skey,",",freq

The above code runs just fine if i do the

   cat input.txt | python mapper.py | sort | python reducer.py

But when I deploy it on clusters. it fails saying that the job has been killed.. and that reason is unknow.

Please help.

Thanks.

score 0 · Answer 1 · answered Nov 21 '12 at 01:06

0

Ok I figured this thing out..

The main issue was that my work local machine is windows based.. whereas the clusters are linux based..

so i had to convert the file written in dos to unix format..

answered Nov 21 '12 at 01:06

frazman

32,081
75
184
269

score 0 · Answer 2 · edited May 23 '17 at 12:11

It is usually good idea to read through the logs in JobHistory, as described in https://stackoverflow.com/a/24509826/1237813 . It should give you more details why the job failed.

Regarding line endings, the class Hadoop Streaming uses by default to split lines is TextInputFormat. It used to break with Windows newlines, but since 2006 it should work just fine.

That leaves your mapper and reducer scripts as a likely source of problems. Python 3 uses something called universal newlines and it should Just Work Out Of The Box with both Unix and Windows newlines. In Python 2.7, you need to explicitly switch it on.

On Linux and Mac OS X you can reopen stdin with universal newlines enabled like this sys.stdin = open('/dev/stdin', 'U'). I do not have a Windows computer at hand to try, but the following should work on all three systems:

import os
import sys

# reopen sys.stdin
os.fdopen(sys.stdin.fileno(), 'U')

for line in sys.stdin:
    …

debugging hadoop streaming progam

2 Answers2