Questions tagged [emr]

Questions relating to Amazon's Elastic MapReduce (EMR) product.

Amazon Elastic MapReduce is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).

http://aws.amazon.com/elasticmapreduce/

See also

Synonymous tag :

1166 questions
0
votes
1 answer

Error in executing Customised WordCount jar in AWS EMR

Hi I am trying to execute customised WordCount jar on AWs EMR. My word count jar is working properly because I tried adding it as a step without job arguments and it is running successfully. My problem is when I run it with job arguments. In my s3 I…
sa_nyc
  • 971
  • 1
  • 13
  • 23
0
votes
2 answers

How should data files be included to mrjob on EMR?

I am trying to run a mrjob on Amazon's EMR. I've tested the job locally using the inline runner, but it fails when running on Amazon. I've narrowed the failure down to my dependence on an external data file zip_codes.txt. If I run without that…
fixedpoint
  • 1,575
  • 1
  • 17
  • 24
0
votes
1 answer

Script to unpack python 2.7 at bootstrap on amazon EMR node

I've got python scripts that require version 2.7. Installing python 2.7 at bootstrap time on EMR using a bash script is easy enough but is taking too long. AWS support suggested I compile Python 2.7 locally, tar the installation and unpack it at…
0
votes
1 answer

How to change an emr job configuration using c# awssdk api

I want the output for my reducer to be zipped (preferably gzip). I am successfully able to launch an EMR job using the c# awssdk but do not know how to change the job confiugration for desired result. I understand i need to set the following…
user2330278
  • 67
  • 10
0
votes
1 answer

Bootstrap action for EMR

While bootstapping on AWS EMR - I am getting the following. Any clues how to resolve it? /mnt/var/lib/bootstrap-actions/1/STAR: /lib/libc.so.6: version 'GLIBC_2.14' not found (required by /mnt/var/lib/bootstrap-actions/1/STAR)
0
votes
2 answers

EMR custom logging from mapper and reducer

Is it possible to have custom logs from mappers and reducers in EMR.... lets say I have a mapper which goes thru data and filters based on certain conditions Mapper code (streaming) Look at input line If useragent is bad - LOG into a custom…
user2330278
  • 67
  • 10
0
votes
1 answer

elastic map reduce "keep alive" specification in the java api

How do I set the jobflow to "keep alive" in the java api like I do with command like like this: elastic-mapreduce --create --alive ... I have tried to add withKeepJobFlowAlivewhenNoSteps(true) but this still makes the jobflow shut down when a step…
Julian
  • 483
  • 1
  • 6
  • 17
0
votes
0 answers

MultiThreadedMapper refuses to find Jar

For some reason everytime I run this program (both on eclipse and on EMR) I get the message 13/07/18 13:22:23 WARN mapred.JobClient: No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String). A few print…
Chenab
  • 93
  • 8
0
votes
0 answers

Am I spawning more threads then I think I am in my mapper?

I'm attempting to make a web parser using and since by nature there is downtime while the program retrieves the document from I made it multithreaded. The idea being that my Threads retrieve the URLS from a url pile. This tripled the speed of the…
Chenab
  • 93
  • 8
0
votes
1 answer

How can I read and write binary files in Cascading?

I want to load some files in binary format (for example jpegs, but could be any binary format), manipulate it somehow and write it back. I want to do that on hadoop, and I would like to write it over Cascading framework. Are there binary sinks /…
polo
  • 1,352
  • 2
  • 16
  • 35
0
votes
1 answer

pig aws emr jython serialization error

I am trying to run a trivial Python UDF in Pig on Amazon EMR and it throws a java serialization error: java.io.IOException: Deserialization error: could not instantiate 'org.apache.pig.scripting.jython.JythonFunction' with arguments…
n2ygk
  • 449
  • 1
  • 5
  • 11
0
votes
1 answer

Slow Hive Query Performance under AWS Elastic MapReduce

There's a strange problem I'm experiencing, and I assure you I've googled a lot. I'm running a set of AWS Elastic MapReduce Clusters, and I have a Hive Table with about 16 partitions. They're created from emr-s3distcp (since there are about 216K…
aldrinleal
  • 3,559
  • 26
  • 33
0
votes
1 answer

Data set join using EMR

I have 2 tab-delimited datasets stored in AWS S3. I am trying to write an EMR job that will join these 2 datasets based on a common key (a set of field values). My current version populates 2 lists and compares them line by line; outputting the rows…
Zihs
  • 347
  • 2
  • 4
  • 17
0
votes
1 answer

Splitting a file using Map Reduce

I would like to split the content of a text file into 2 different files using EMR. The input file, as well as the mapper and reducer scripts are all stored in AWS' S3. Currently, my mapper reformats the inputs of stdin by tab-delimiting each field…
Zihs
  • 347
  • 2
  • 4
  • 17
0
votes
1 answer

How to merge the small files on S3 generated by EMR with thousands of reducers

My cascalog EMR job generated thousands of small files on S3 buckets. It generate the same number of files as the number of reducers I used. Dumping all these tiny files take minutes. I wonder if there is a way to concat them on S3 so that I can…
rninja
  • 540
  • 1
  • 4
  • 12