Questions tagged [elastic-map-reduce]

Amazon Elastic MapReduce is a web service that enables the processing of large amounts of data.

Amazon Elastic MapReduce is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).

http://aws.amazon.com/elasticmapreduce/

See also

Synonymous tag :

452 questions
4
votes
1 answer

How to configure an Amazon EMR streaming job to use EC2 spot instances (Ruby CLI)?

When I create a streaming job with Amazon Elastic MapReduce (Amazon EMR), using the Ruby command line interface, how can I specify to use only EC2 spot instances (except for master)? The command below is working, but it "forces" me to use at lease 1…
Renaud
  • 16,073
  • 6
  • 81
  • 79
4
votes
1 answer

hadoop converting \r\n to \n and breaking ARC format

I am trying to parse data from commoncrawl.org using hadoop streaming. I set up a local hadoop to test my code, and have a simple Ruby mapper which uses a streaming ARCfile reader. When I invoke my code myself like cat 1262876244253_18.arc.gz |…
Ben Nagy
  • 163
  • 1
  • 6
4
votes
1 answer

How can correct data types on Apache Pig be enforced?

I am having trouble SUMming a bag of values, due to a Data type error. When I load a csv file whose lines look like this: 6 574 false 10.1.72.23 2010-05-16 13:56:19 +0930 fbcdn.net static.ak.fbcdn.net 304 text/css 1 …
mindonaut
  • 43
  • 1
  • 5
4
votes
4 answers

Using Amazon MapReduce/Hadoop for Image Processing

I have a project that requires me to process a lot (1000-10000) of big (100MB to 500MB) images. The processing I am doing can be done via Imagemagick, but I was hoping to actually do this processing on Amazon's Elastic MapReduce platform (which I…
4
votes
1 answer

boto ElasticMapReduce throttling and rate limiting

I've run into rate limting from Amazon EMR a few times via boto API with the following: boto.exception.EmrResponseError: EmrResponseError: 400 Bad Request
poiuy
  • 500
  • 5
  • 12
4
votes
1 answer

Specifying additional jars in AWS EMR custom jar application

I am trying to run a hadoop job on an EMR cluster. It is being run as a Java command for which I use a jar-with-dependencies. The job pulls data from Teradata and I am assuming Teradata related jars are also packed within the jar-with-dependencies.…
Nik
  • 5,515
  • 14
  • 49
  • 75
4
votes
1 answer

How can I remove files from /usr/lib/hadoop/lib before running an EMR job on AMI 4.x?

I have a Hadoop job which uses version 1.5 of the commons-codec library. In order to make this job run on EMR AMI 3.x, I had to create a bootstrap action which deleted all earlier versions of the jar from the cluster to prevent them from being…
fblundun
  • 987
  • 7
  • 19
4
votes
1 answer

Error: undefined method "each" for String when running elastic-mapreduce specifying distributed cache file

I've got the following error: Error: undefined method `each' for "s3n://dico-count-words/Cache/dicoClazz.p#dicoClazzCache.p":String When I run the following command line to launch a mapreduce algorithm on Amazon EMR cluster via elastic-mapreduce,…
Garnieje
  • 286
  • 3
  • 7
4
votes
5 answers

how to run/install oozie in EMR cluster

I want to orchestrate my EMR jobs. so I thought oozie will be good fit. I have done some POCs on oozie workflow but in local mode, its fairly simple and great. But I dont understand how to use oozie on EMR cluster. Based on some search I got to know…
sunil
  • 1,259
  • 1
  • 14
  • 27
4
votes
3 answers

Writing to a file in S3 from jar on EMR on AWS

Is there any way in which I can write to a file from my Java jar to an S3 folder where my reduce files would be written ? I have tried something like: FileSystem fs = FileSystem.get(conf); FSDataOutputStream FS = fs.create(new Path("S3…
4
votes
1 answer

elastic map reduce timing out java.io.IOException: Unexpected end of stream

I am running MAP reduce job (Elastic map reduce EMR ) service.The job works fine for small data set but gives following exceptions for large data set (File size 400MB) Running another job with same big input file works fine but.Why so? Error:…
user93796
  • 18,749
  • 31
  • 94
  • 150
4
votes
1 answer

s3distcp srcPattern not working?

I have files like this in S3: 1-2013-08-22-22-something 2-2013-08-22-22-something etc without srcPattern I can get all of the files from the bucket easily but I want to get a specific prefix, for example all of the 1's. I've tried using srcPattern…
Julian
  • 483
  • 1
  • 6
  • 17
4
votes
1 answer

ElasticMapReduce: Specified Availability Zone is not supported

I tried to use EMR in Oregon region so I used "us-west-2" as availability zone in run_job_flow and I got the following error: Error response for action RunJobFlow: Sender/ValidationError; Specified Availability Zone is not supported
kee
  • 10,969
  • 24
  • 107
  • 168
4
votes
1 answer

Specifying other user owned S3 buckets in EMR job flows

I am trying to use an S3 bucket as input data for my Elastic Map Reduce job flow. The S3 bucket does not belong to the same account as the EMR job flow. How and where should I specify the S3 bucket credentials to access the respective S3 bucket. I…
4
votes
4 answers

Amazon Elastic Map Reduce : Listing job flows in command line tools Issue?

I'm new to Amazon web services, I'm trying to run job flows on Amazon elastic map reduce jobs using command line interface tools. I followed the steps from amazon developer guide of this developer guide from aws.But things are not getting clear to…