Questions tagged [elastic-map-reduce]

Amazon Elastic MapReduce is a web service that enables the processing of large amounts of data.

Amazon Elastic MapReduce is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).

http://aws.amazon.com/elasticmapreduce/

See also

Synonymous tag :

452 questions
1
vote
1 answer

Elasticsearch-Hadoop get Non-indexed data

I have an elasticsearch cluster which has big amount of data. I want to extract all data from elasticsearch into Hadoop(Hive). I used Elasticsearch-Hadoop driver in order to extract data from elasticsearch by using Hive external table but it is too…
1
vote
1 answer

Run an action in a bootstrap script after ResourceManager has started

I am starting an AWS EMR cluster using the amazon aws cli tools. I have a boostrap action that runs on the master and runs the following command hdfs dfs -put /home/hadoop/X.tar.gz / However I get the following error put: Call From…
Sapsi
  • 711
  • 5
  • 16
1
vote
1 answer

If first attemp to reduce faills (network connection issues), the subsequent reduce attempts (retry) will fail because the output file already exists

I have mapreduce jobs failing big on Amazon EMR because if the first attempt fails to copy results to S3, the file (probably partial) will be created and subsequent reduce attempts will refuse write on a file that already exists. The first attempt…
SQL.injection
  • 2,607
  • 5
  • 20
  • 37
1
vote
1 answer

Where to access EMR counters for a terminated or running cluster

I'm running a jobflow on ElasticMapReduce, that terminates after completing all steps. How can I access the custom counters of each mapper or reducer after the cluster is killed? (maybe somewhere on s3 with the logs, if at all) How can I access…
eran
  • 14,496
  • 34
  • 98
  • 144
1
vote
2 answers

Problems while creating a hadoop client in my local machine

I have a namenode and data nodes running on aws. I configured foxyproxy and checked the following which are working: Ganglia Metrics Reports master-public-dns/ganglia/ Hadoop ResourceManager master-public-dns-name:9026 Hadoop NameNode …
1
vote
1 answer

Amazon Web Service EMR FileSystem

I am trying to run a job on an AWS EMR cluster. The problem Im getting is the following: aws java.io.IOException: No FileSystem for scheme: hdfs I dont know where exactly my problem resides (in my java jar job or in the configurations of the job) In…
1
vote
1 answer

AWS - How can I add EMR step in current step

I have an EMR cluster that runs a single step - custom JAR. I need to create a second step from the first step at runtime, how can I do it? I know I can do it using the CLI but how can I accomplish it using java? Thanks
Eitan Illuz
  • 323
  • 2
  • 7
1
vote
1 answer

Number of concurrently running mappers per node drops precipitously on Elastic MapReduce w/ AMI 3.1.0 and Hadoop 2.4.0 as cluster size increases

In a related question (How to set the precise max number of concurrently running tasks per node in Hadoop 2.4.0 on Elastic MapReduce), I ask for formulas relating the number of concurrently running mappers/reducers to YARN and MR2 memory parameters.…
1
vote
2 answers

Running Simple Hadoop Command using Java code

I would like to list files using hadoop command. "hadoop fs -ls filepath". I want to write a Java code to achieve this. Can I write a small piece of java code, make a jar of it and supply it to Map reduce job(Amazon EMR) to achieve this ? Can you…
1
vote
2 answers

How to import local Python package in Amazon Elastic MapReduce (EMR)?

I have two Python scripts that are intended to run on Amazon Elastic MapReduce - one as a mapper and one as a reducer. I've just recently expanded the mapper script to require a couple more local models that I've created that both live in a package…
1
vote
1 answer

Region error when launching EMR cluster

I'm following this tutorial https://aws.amazon.com/articles/4926593393724923 to create and launch a simple spark cluster, Im interested in using spark streaming and kinesis so i created a role with the following policy { "Version": "2012-10-17", …
franklynd
  • 1,850
  • 3
  • 13
  • 11
1
vote
2 answers

Hadoop on EMR - Map Tasks Not Parallel

I've set up an EMR job through Data Pipeline in AWS. This job is to transfer CSV data from S3 to DynamoDB. My data size is 400 MB. I set mapred.max.split.size = 134217728 (i.e. 128 MB). With that, I'm able to see in monitoring graph that there are 3…
Mouli
  • 1,621
  • 15
  • 20
1
vote
1 answer

How is data distributed among datanodes in MapReduce?

I'm new to MapReduce, I'm having the task to process large data(lines of records). One thing I should use is the line number of specific record in my mapper, and then reducer process the line number information based on the mapper. For instance,…
i3wangyi
  • 2,279
  • 3
  • 15
  • 12
1
vote
1 answer

Copying a large file (~6 GB) from S3 to every node of an Elastic MapReduce cluster

Turns out that copying a large file (~6 GB) from S3 to every node in an Elastic MapReduce cluster in a bootstrap action doesn't scale well; the pipe is only so big, and downloads to the nodes get throttled as # nodes gets large. I'm running a job…
1
vote
1 answer

"Access Denied" error using segue package in R

I suspect this is a very basic fix but I don't know what it is. setCredentials(awsAccessKeyText = 'myaccesskey', awsSecretKeyText = 'mysecretkey') myCluster <- createCluster(numInstances = 2) Error in .jcall("RJavaTools",…
Nan
  • 446
  • 4
  • 14