Program fails on AWS EMR with hadoop (OK on local machine)

Question

I am trying to use python's fuzzywuzzy package in mapper program for computing edit distance. My program runs fine on local machine but it fails on AWS emr cluster. I tried below two approaches(on both local machine and also on AWS EMR cluster):

1. By installing fuzzywuzzy:

I installed fuzzywuzzy using pip on both master and slave nodes. If I comment out last 4 lines of below code, I do not get any error. But I want to use fuzzywuzzy in my program.

!/usr/bin/python  
import re
import sys
import os
import csv

desc_dict = {}
with open('Keys.csv', 'rb') as f:
    reader = csv.reader(f)
    for row in reader:
            query_set = row

for line in sys.stdin:
  line = line.strip() 
  row = line.split(',')
  if(len(row)>2):
      desc_dict[(int(row[0]), row[1])] = (row[2].lower()).encode('utf-8')
from fuzzywuzzy import *
import fuzzywuzzy.fuzz
import fuzzywuzzy.utils
print fuzzywuzzy.fuzz.partial_ratio("this is a test", "this is a test!")

I get below error:

 Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1

2. Without installing fuzzywuzzy

I could run above map-reduce program without installing fuzzywuzzy on local machine. When I tried the same on AWS EMR it failed.

I zipped fuzzywuzzy package ("temp.zip") and called it in my map program. I copied temp.zip file to slave nodes also.

!/usr/bin/python
import re import sys import os import csv

desc_dict = {}
with open('Keys.csv', 'rb') as f:
    reader = csv.reader(f)
    for row in reader:
            query_set = row

for line in sys.stdin:
  line = line.strip() 
  row = line.split(',')
  if(len(row)>2):
      desc_dict[(int(row[0]), row[1])] = (row[2].lower()).encode('utf-8')

sys.path.insert(0,'temp.zip')
from fuzzywuzzy import *
import fuzzywuzzy.fuzz
print fuzzywuzzy.fuzz.partial_ratio("this is a test", "this is a test!")

I get below error:

 Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1

Can someone guide what is wrong with my code/ how to run fuzzywuzzy on hadoop?

score 0 · Answer 1 · answered Jan 05 '15 at 05:54

0

I could get fuzzywuzzy working by copying fuzzywuzzy install files to master and slave nodes and then install fuzzywuzzy manually

python setup.py install

pip install did not install fuzzywuzzy even though it was successful.

answered Jan 05 '15 at 05:54

Chandra

526
2
9
26

Program fails on AWS EMR with hadoop (OK on local machine)

1 Answers1