Questions tagged [elastic-map-reduce]

Amazon Elastic MapReduce is a web service that enables the processing of large amounts of data.

Amazon Elastic MapReduce is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).

http://aws.amazon.com/elasticmapreduce/

See also

Synonymous tag :

452 questions
3
votes
2 answers

Amazon Elastic MapReduce Bootstrap Actions not working

I have tried the following combinations of bootstrap actions to increase the heap size of my job but none of them seem to work: --mapred-key-value mapred.child.java.opts=-Xmx1024m --mapred-key-value mapred.child.ulimit=unlimited --mapred-key-value…
2
votes
2 answers

How to parse freebase quad dump using Amazon mapreduce

Im trying to extract movie informations from freebase, i just need name of the movie, name and id of the director and of the actors. I found it hard to do so using freebases topic dumps, because there is no reference to the director ID, just…
Jaroušek Puchlivec
  • 221
  • 1
  • 2
  • 11
2
votes
0 answers

OutOfMemory error when running full-scale hadoop job

I'm running a hadoop job on Amazon Elastic MapReduce and I keep getting an OutOfMemory error. The values are admittedly a little bit larger than most MapReduce values, but it seems even when I decrease the size dramatically it still happens. Here's…
dspyz
  • 5,280
  • 2
  • 25
  • 63
2
votes
1 answer

getting data out of hive and into mysql @ AWS?

I'd love to use Sqoop but don't think it is worth running the Cloudera stack @ AWS over ElasticMapReduce (which I really like) just for this. My current thought is just to write the data I need moved to an external table housed @ S3 and then write…
2
votes
1 answer

EC2 Job Flow Failure

I have a jar file MapReduce that I'd like to run on s3. It takes two args, an input dir and an output file. So I tried the following command using the elastic-mapreduce ruby cmd line tool: elastic-mapreduce -j j-JOBFLOW --jar…
user592419
  • 5,103
  • 9
  • 42
  • 67
2
votes
1 answer

Has anybody created a job with multiple inputs using the the ruby client for Amazon's Elastic Map Reduce?

Through the UI Amazon's framework allows me to create jobs with multiple inputs by specifying multiple --input lines. e.g.: -input s3n://something -input s3n://something-else Similarly the Ruby EMR client has been very helpful to me so…
henry
  • 1,716
  • 3
  • 15
  • 27
2
votes
1 answer

Streaming Command Failed! error when using Elastic Map Reduce/S3 and R

I'm following this example here hoping to successfully run something using EC2/S3/EMR/R. https://gist.github.com/406824 The job fails on the Streaming Step. Here are the error logs: controller: 2011-07-21T19:14:27.711Z INFO Fetching jar…
tcash21
  • 4,880
  • 4
  • 32
  • 39
2
votes
1 answer

Amazon Elastic MapReduce - Format or Examples for python map and reduce code

Maybe it is the same has Hadoop but I just couldn't find what is the format or example of writing the map and reduce python code beside map example here: http://docs.amazonwebservices.com/ElasticMapReduce/latest/GettingStartedGuide/ but I couldn't…
Alon Gutman
  • 865
  • 2
  • 14
  • 22
2
votes
1 answer

How is input data distributed across nodes for EMR [using MRJob]?

I'm looking into using Yelp's MRJob to compute using Amazon's Elastic Map Reduce. I will need to read and write a large amount of data during the computationally intensive job. Each node should only get a part of the data, and I'm confused about…
2
votes
2 answers

Spark possible race condition in driver

I have a Spark job that processes several folders on S3 per run and stores its state on DynamoDB. In other words, we're running the job once per day, it looks for new folders added by another job, transforms them one-by-one and writes state to…
2
votes
1 answer

IllegalAccessError when running spark job in EMR

I am attempting to run a spark job that accesses dynamodb and the old way of instantiating a dynamoDb client has been deprecated and it is now recommended to use the client builder. Well, this works fine locally, but when I deploy to EMR i'm…
2
votes
1 answer

Unable to read sequence file from distributed cache in EMR

I am trying to sequence file from distributed cache in EMR but its unable to read the file from distributed cache in EMR. My code works fine in local but its giving me issue on emr. Here is my code snippet- Putting sequence file to distributed…
2
votes
1 answer

Hadoop process WARC files

I have a general question about Hadoop file splitting and multiple mappers. I am new to Hadoop and am trying to get a handle on how to setup for optimal performance. My project is currently processing WARC files which are GZIPed. Using the current…
2
votes
1 answer

How to create an EMR cluster using AWS SDK for Go

I want to create EMR clusters using AWS SDK for Go, but I can't find a way in the official documentation. Package: emr — AWS SDK for Go Cound you please help me with a detailed code?
NSR
  • 819
  • 7
  • 20
2
votes
0 answers

What is causing "org.elasticsearch.hadoop.rest.EsHadoopInvalidRequest: null"?

I have an Elastic MapReduce job which uses elasticsearch-hadoop via scalding-taps to transfer data from Amazon S3 to Amazon Elasticsearch Service. For a long time this job ran successfully. However, it has recently started failing with the following…