Questions tagged [elastic-map-reduce]

Amazon Elastic MapReduce is a web service that enables the processing of large amounts of data.

Amazon Elastic MapReduce is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).

http://aws.amazon.com/elasticmapreduce/

See also

Synonymous tag :

452 questions
1
vote
1 answer

EMR & Spark adding dependencies after cluster creation

Is it possible to install additional libs/dependencies after the cluster is already up and running? Things I've done that are related to this: I've already done the pre-creation bootstrapping process (this is a different solution…
Kristian
  • 21,204
  • 19
  • 101
  • 176
1
vote
1 answer

Elastic Map Reduce and amazon s3: Error regarding access keys

I am new to Amazon EMR and Hadoop in general. I am currently trying to set up a Pig job on an EMR cluster and to import and export data from S3. I have set up a bucket in s3 with my data named "datastackexchange". In an attempt to begin to copy the…
Maeve90
  • 345
  • 1
  • 6
  • 14
1
vote
0 answers

AWS EMR - Python path, git repo and scripts

I am running MapReduce jobs on Hive and most of the code already resides in a git repo. I know I am able to include instructions in the bootstrap script when spawning up clusters, but is it possible to do all these things: Adjust the python path in…
intl
  • 2,753
  • 9
  • 45
  • 71
1
vote
0 answers

Elastic Search Match_phrase not giving deterministic result

I have defined mapping in following way. PUT _template/name "mappings": { "_default_": { "name": { "type": "string", "analyzer" : "synonyms_expand", "index" : "analyzed", …
1
vote
1 answer

How to prevent hadoop fail job due to failed reduce task

I have running a s3distcp job in AWS EMR hadoop 2.2.0 version. And the job keep failed with a failed reducer task after 3 attempts. I also tried both: mapred.max.reduce.failures.percent mapreduce.reduce.failures.maxpercent to be 50 to the oozie…
user3285517
  • 11
  • 1
  • 4
1
vote
0 answers

Running the temperature example on EMR hadoop cluster using Hadoop Development tools on eclipse

I'm still a newbie in hadoop. I'm trying to run the common temperature example on an Amazon EMR hadoop 2.6.0 cluster using Hadoop Development Tools on eclipse, I'm connecting through an SSH Tunnel and don't have connection problems so far since I…
Learner
  • 60
  • 12
1
vote
2 answers

R: replacing double escaped text

I'm gluing together a number of system calls using the Amazon Elastic Map Reduce command line tools. These commands return JSON text which has already been (partially?) escaped. Then when the system call turns it into an R text object (intern=T) it…
JD Long
  • 59,675
  • 58
  • 202
  • 294
1
vote
1 answer

fs.s3.awsAccessKeyId and fs.s3.awsSecretAccessKey are not set for EMR default IAM roles

One of my EMR job relies on getting the AWS access key id and secret access key from the fs.s3.awsAccessKeyId and fs.s3.awsSecretAccessKey properties, respectively. However, when I run EMR cluster using the default EC2 and EMR roles, those…
Kiet Tran
  • 1,458
  • 2
  • 13
  • 22
1
vote
0 answers

move data from HDFS to RDS directly

Background: I am working on a web project to expose analytical data stored on a local MSSQL database. The database is updated regularly. An EMR cluster is responsible to use custom Hive scripts to process raw data from S3 and save the analytical…
Tzu
  • 235
  • 1
  • 9
1
vote
2 answers

Processing HUGE number of small files independently

The task is to process HUGE (around 10,000,000) number of small files (each around 1MB) independently (i.e. the result of processing file F1, is independent of the result of processing F2). Someone suggested Map-Reduce (on Amazon-EMR Hadoop) for my…
Daniel
  • 5,839
  • 9
  • 46
  • 85
1
vote
1 answer

Small files with Map Reduce or multi threading/multi processing

I've a batch of 500 files each around 45 kb. Each file requires around 87840 calculations (ARIMA regression problems) to be made. And each calculation is independent in it self. Given this, what is the best approach to develop a solution for such a…
NightOwl85
  • 161
  • 1
  • 1
  • 7
1
vote
1 answer

MRUnit Example for MultipleOutputs

I have written a Map only hadoop job in which i have used MultipleOutputs concept. The problem here is, i want to test this code with MRUnit. I don't see any working example for MultipleOutputs testing. My mapper code will be like, public void…
1
vote
1 answer

How to pass arguments to streaming job on Amazon EMR

I want to produce the output of my map function, filtering the data by dates. In local tests, I simply call the application passing the dates as parameters as: cat access_log | ./mapper.py 20/12/2014 31/12/2014 | ./reducer.py Then the parameters…
1
vote
1 answer

Elasticsearch query strategy for nested array elements

I am trying to find results by color. In the database, it is recorded in rgb format: an array of three numbers representing red, green, and blue values respectively. Here is how it is stored in the db and elasticsearch record (storing 4 rgb colors…
diego
  • 123
  • 2
  • 14
1
vote
0 answers

how to compare values from 2 index of elastic store in kibana

I have 2 indices in my elasticsearch store. Now in Kibana visualisation, i have to search for a count of docs in index B where "col_A" of indexA is equal to "col_H" of IndexB. Is this possible in kibana? If so please help me with the queries. TIA…
h4it
  • 33
  • 1
  • 5