I am trying to use python's fuzzywuzzy package in mapper program for computing edit distance. My program runs fine on local machine but it fails on AWS emr cluster. I tried below two approaches(on both local machine and also on AWS EMR cluster):
1. By installing fuzzywuzzy:
I installed fuzzywuzzy using pip on both master and slave nodes. If I comment out last 4 lines of below code, I do not get any error. But I want to use fuzzywuzzy in my program.
!/usr/bin/python
import re
import sys
import os
import csv
desc_dict = {}
with open('Keys.csv', 'rb') as f:
reader = csv.reader(f)
for row in reader:
query_set = row
for line in sys.stdin:
line = line.strip()
row = line.split(',')
if(len(row)>2):
desc_dict[(int(row[0]), row[1])] = (row[2].lower()).encode('utf-8')
from fuzzywuzzy import *
import fuzzywuzzy.fuzz
import fuzzywuzzy.utils
print fuzzywuzzy.fuzz.partial_ratio("this is a test", "this is a test!")
I get below error:
Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
2. Without installing fuzzywuzzy
I could run above map-reduce program without installing fuzzywuzzy on local machine. When I tried the same on AWS EMR it failed.
I zipped fuzzywuzzy package ("temp.zip") and called it in my map program. I copied temp.zip file to slave nodes also.
!/usr/bin/python
import re
import sys
import os
import csv
desc_dict = {}
with open('Keys.csv', 'rb') as f:
reader = csv.reader(f)
for row in reader:
query_set = row
for line in sys.stdin:
line = line.strip()
row = line.split(',')
if(len(row)>2):
desc_dict[(int(row[0]), row[1])] = (row[2].lower()).encode('utf-8')
sys.path.insert(0,'temp.zip')
from fuzzywuzzy import *
import fuzzywuzzy.fuzz
print fuzzywuzzy.fuzz.partial_ratio("this is a test", "this is a test!")
I get below error:
Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
Can someone guide what is wrong with my code/ how to run fuzzywuzzy on hadoop?