Questions tagged [elastic-map-reduce]

Amazon Elastic MapReduce is a web service that enables the processing of large amounts of data.

Amazon Elastic MapReduce is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).

http://aws.amazon.com/elasticmapreduce/

See also

Synonymous tag :

452 questions
0
votes
1 answer

Python program in AWS Elastic MapReduce fails in step execution

I'm trying to start a Python program in Elastic MapReduce Step Execution. It is a Spark Application with the following parameters: Deploy-mode: Cluster Spark-submit options: --executor-memory 1g Application location:…
0
votes
0 answers

ElasticSearch Performance optimization

I have a single node dedicated for ES server. We have indexed around 25 GB Data in it. I am using a bool query to fetch data and it takes around 6-7 minutes to give the results. The RAM on the node is just 1 GB. I understand that the RAM and other…
0
votes
1 answer

How to know how many keys did map-reduce job processed?

How can map-reduce job generate metrics about how many keys it has processed and give data like the following? % of keys that belonged to this particular value.
adarshhsingh
  • 61
  • 1
  • 1
  • 6
0
votes
1 answer

Visitor / User profiling based on clickstream data?

We build a rails 4 site and use ES for our search travel/accommodation engine. We created a separate ES index for clickstream data, and we store data for non-login(session_id) and login users (user_id). We use the stored data now to show viewed and…
0
votes
1 answer

MapReduce with filename as Key, contents as Values, many small files

I've looked at FileInputFormat where filename is KEY and text contents are VALUE, How to get Filename/File Contents as key/value input for MAP when running a Hadoop MapReduce Job?, and Getting Filename/FileData as key/value input for Map when…
kcmgrew
  • 21
  • 3
0
votes
1 answer

take sample of file from AWS s3 and put to another location in s3

It is always possible using s3distcp to copy a file(or set of files) into another location of s3, but is it possible, using mapred or any other functionality of Hadoop/EMR to take a random sample(or every nth line) of the file(s) to a new location…
Kuber
  • 1,023
  • 12
  • 21
0
votes
2 answers

Hadoop MapReduce Out of Memory on Small Files

I'm running a MapReduce job against about 3 million small files on Hadoop (I know, I know, but there's nothing we can do about it - it's the nature of our source system). Our code is nothing special - it uses CombineFileInputFormat to wrap a bunch…
0
votes
1 answer

What is the minimal set of outbound rules required of the master/slave security groups for an EMR cluster?

I'm trying to secure a pipeline for analyzing controlled-access genomic data with Amazon Elastic MapReduce (EMR), and it would help to know the minimal set of outbound rules required of the master and slave security groups of an EMR cluster. I'm…
verve
  • 775
  • 1
  • 9
  • 21
0
votes
1 answer

PDI jobs not seen as Mapreduce jobs in Resource Manager or Job History server

I am using Pentaho 5.4 and EMR 3.4 When I execute a transformation in Pentaho to copy data from Oracle DB to HDFS on EMR, I don't see any MR jobs in Resource manager of the Hadoop(EMR) cluster. Am I supposed to see them as MR jobs or pentaho just…
0
votes
1 answer

Can an EMR cluster be launched into a private VPC subnet with no public IPs that accesses the internet through a NAT instance in a public subnet?

Is it possible to launch an EMR cluster into the private subnet of a scenario-2 VPC (http://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/VPC_Scenario2.html) where a NAT instance is in the public subnet, and where each instance in the private…
verve
  • 775
  • 1
  • 9
  • 21
0
votes
1 answer

DynamoDB schema for referral data

I'm wanting to try out DynamoDB and use it for access.logs generated by nginx, which will later be used for a reporting dashboard, that'll include IP, referral url, referral domain, browser, etc. The initial setup will be EC2 instances running nginx…
dzm
  • 22,844
  • 47
  • 146
  • 226
0
votes
0 answers

HADOOP HIVE mr.MapredLocalTask (MapredLocalTask.java:execute(276)) - Execution failed with exit status: 137

Im trying to run a job in hive with cluster(1 master, 4 core nodes[11.25GB each]) in AWS EMR, im joining(MAP joining) two tables one with 0.3 million entries(~11mb) and another table with almost 7 million entries(took care that big table should be…
jeevan sirela
  • 23
  • 1
  • 7
0
votes
0 answers

Load hive tables from multiple mappers

I am working on the problem where I have a large number of small compressed text file. Each file size is approx 10-20kb and have TBs of data. I need to load these files into Hive. Later, Tableau will use HIVE tables for its report generation. I am…
Ajay
  • 783
  • 3
  • 16
  • 37
0
votes
1 answer

Amazon ElasticMapReduce(EMR) controlling split size / num of mappers

How can I change this configuration? For my application, a split size of 64/128 is too much for me, and I would like to have a split size of 16 mb for example. How can I do it?
member555
  • 797
  • 1
  • 13
  • 40
0
votes
0 answers

Hadoop - Directory Structure and Distributed Cache

Imagine the situation, that I have multiple jobs executing concurrently in a hadoop cluster. These jobs are using the Distributed Cache. Each of them use diferent files , but with the same name. (I am using the ToolInteface to distribute these…
p.magalhaes
  • 7,595
  • 10
  • 53
  • 108