Questions tagged [elastic-map-reduce]

Amazon Elastic MapReduce is a web service that enables the processing of large amounts of data.

Amazon Elastic MapReduce is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).

http://aws.amazon.com/elasticmapreduce/

See also

Synonymous tag :

452 questions
2
votes
1 answer

Anyone using DynamoDB and Hive without using EMR?

I was reading the below integration of using Hive for querying data on DynamoDB. http://aws.typepad.com/aws/2012/01/aws-howto-using-amazon-elastic-mapreduce-with-dynamodb.html But as per that link, Hive needs to be setup on top of EMR. But I wanted…
Arvind
  • 697
  • 1
  • 9
  • 20
1
vote
2 answers

How can we pass arguments for Hadoop Streaming from AWS SDK for PHP?

I'm trying to add some job via AWS SDK for PHP. I'm able to successfully start a cluster and start new job flow via API but I'm getting an error while trying to create Hadoop Streaming step. Here is my code: // add some jobflow steps $response =…
1
vote
3 answers

how to run a mapreduce job on amazon's elastic mapreduce (emr) cluster from windows?

i'm trying to learn how to run a java Map/Reduce (M/R) job on amazon's EMR. the documentation that i am following is here http://aws.amazon.com/articles/3938. i am on a windows 7 computer. when i try to run this command, i am shown the help…
Jane Wayne
  • 535
  • 9
  • 21
1
vote
1 answer

API calls inside mapreduce job

I would want to ask you about the inconveniences of calling an external API while running a map reduce job. which are the drawbacks? Some examples: If inside the mapper we need to geocode an address and we call a google maps api, or calling an…
Fgblanch
  • 5,195
  • 8
  • 37
  • 51
1
vote
1 answer

Calling a compiled binary on Amazon MapReduce

I'm trying to do some data analysis on Amazon Elastic MapReduce. The mapper step is a python script which includes a call to a compiled C++ binary called "./formatData". For example: # myMapper.py from subprocess import * inputData =…
tba
  • 6,229
  • 8
  • 43
  • 63
1
vote
3 answers

Does Amazon Elastic Map Reduce runs one or several mapper processes per instance?

My question is: should I care about multiprocessing in my mapper myself (read tasks from stdin then distribute them over worker processes, combine results in a master process and output to stdout) or Hadoop will take care of it automatically? I…
1
vote
1 answer

How to set number of mapreduce task equal to 1 in hive

I tried following in hive- set hive.exec.reducers.max = 1; set mapred.reduce.tasks = 1; from flat_json insert overwrite table aggr_pgm_measure PARTITION(dt='${START_TIME}') reduce log_time, req_id, ac_id, client_key, rulename, categoryname, bsid,…
Anurag Saxena
  • 11
  • 1
  • 3
1
vote
1 answer

Amazon MapReduce input splitting and downloading

I'm new to EMR and just had a few questions i have been struggling with the past few days. The first of which is the logs that i want to process are already compressed as .gz and i was wondering if these types of files are able to be split by emr so…
Brian
  • 45
  • 7
1
vote
2 answers

Exploring Hadoop code

I wanted to know about Hadoop more than a black box. I wanted to explore the Hadoop code itself. How can I download the bundle not from the trunk and where should I start from? Any help would be really helpful Thanks Shujaat
shujaat
  • 279
  • 6
  • 17
1
vote
0 answers

What are some good measurement comparisons to be done using Ganglia metrics for Amazon Elastic Mapreduce programs?

I have seen Ganglia monitoring being implemented and analyzed on grid computing projects, but haven't read about any procedure for Amazon Elastic Mapreduce programs. Ganglia has a lot of metrics, but what are the important ones to focus on if we…
1
vote
3 answers

Error SSHing to Elastic MapReduce JobFlow on AWS

When following the tutorial instructions for connecting to my JobFlow in EMR, I type following: ./elastic-mapreduce --jobflow j-3FLVMX9CYE5L6 --ssh and get this error: Permission denied (publickey) I'm already able to run other elastic-mapreduce…
Trindaz
  • 17,029
  • 21
  • 82
  • 111
1
vote
3 answers

POST Hadoop Pig output to a URL as JSON data?

I have a Pig job which analyzes log files and write summary output to S3. Instead of writing the output to S3, I want to convert it to a JSON payload and POST it to a URL. Some notes: This job is running on Amazon Elastic MapReduce. I can use a…
emk
  • 60,150
  • 6
  • 45
  • 50
1
vote
0 answers

EMR cluster running slow

I was running a map reduce Hadoop job on Amazon EMR 5.5.2 which uses Hadoop 2.7.3. I recently upgraded EMR to 5.12.1 which uses Hadoop 2.8.0. For the same input load, my new cluster is running comparatively very slow. I am not able to find out the…
1
vote
1 answer

How to configure AWS EMR to use s3 as hdfs storage

I am trying to create a EMR cluster with below configurations, but is failing in Bootstrap stage. The EMR release I am using is EMR 5.13.0 [ { "Classification": "core-site", "Properties": { "fs.defaultFS": "s3://my-s3-bucket", …
Utk787
  • 81
  • 8
1
vote
1 answer

AWS EMR: Is it possible to re-use a terminated cluster?

I create a cluster. I finished my job and then I terminated the cluster. I want to know that is it possible to re-use this terminated cluster in the future? If no, is there anyway to delete the terminated clusters?