Questions tagged [elastic-map-reduce]

Amazon Elastic MapReduce is a web service that enables the processing of large amounts of data.

Amazon Elastic MapReduce is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).

http://aws.amazon.com/elasticmapreduce/

See also

Synonymous tag :

452 questions
2
votes
1 answer

Hadoop DistributedCache object changed during job

I'm trying to run KMeans on AWS, and I ran into the following exception when trying to read updated cluster centroids from the DistributedCache: java.io.IOException: The distributed cache object s3://mybucket/centroids_6/part-r-00009 changed during…
Magsol
  • 4,640
  • 11
  • 46
  • 68
2
votes
0 answers

Scalability issues with templatetap

I wrote a cascading 1.2 program that does the following processing from data of a sensor network: Read CSV files having 3 columns: millisecond timestamp, event type (either of sensor data, battery level, sensor power state), event body Round up the…
newToFlume
  • 51
  • 1
  • 8
2
votes
1 answer

Write 100 million files to s3

My main aim is to split out records into files according to the ids of each record, and there are over 15 billion records right now which can certainly increase. I need a scalable solution using Amazon EMR. I have already got this done for a smaller…
Amar
  • 11,930
  • 5
  • 50
  • 73
2
votes
1 answer

EMR - Leverage using spot instances

I know that we can bid on spot instances and get them at lower prices than that of regular instances, but with spot instances there is the risk of your instances being taken back. I want to know that is there any way we can ensure that they are…
user1804287
2
votes
4 answers

Best way to have a fast access key-value storage for huge dataset (5 GB)

There is a dataset of ~5GB in size. This big dataset just has a key-value pair per line. Now this needs to be read for the value of keys some billion times. I have already tried disk based approach of MapDB, but it throws ConcurrentModification…
Amar
  • 11,930
  • 5
  • 50
  • 73
2
votes
1 answer

Error while connecting Elastic Map Reduce ruby client

I am following the steps mentioned on the AWS to use an interactive Hive session using SSH. I used the following resources…
asquare
  • 77
  • 1
  • 3
  • 11
2
votes
1 answer

How to decide on number of parallel mapers/reducers along with Heap memory?

Say I have a EMR job running on 11 node cluster: m1.small master node while 10 m1.xlarge slave nodes. Now one m1.xlarge node has 15 GB of RAM. How to then decide on the number of parallel mappers and reducers which can be set? My jobs are memory…
Amar
  • 11,930
  • 5
  • 50
  • 73
2
votes
1 answer

How Can I Automate Running Pig Batch Jobs on Elastic MapReduce without Amazon GUI?

I have some pig batch jobs in .pig files I'd love to automatically run on EMR once every hour or so. I found a tutorial for doing that here, but that requires using Amazon's GUI for every job I setup, which I'd really rather avoid. Is there a good…
Eli
  • 36,793
  • 40
  • 144
  • 207
2
votes
1 answer

Why does the Amazon .Net SDK not see any job flows?

My company has grown weary of constantly using the AWS console to setup our map reduce clusters and needs more configurability than the console provides. I'm using the .Net AWS SDK to write a simple application that allows us to create and control…
Chris Phillips
  • 11,607
  • 3
  • 34
  • 45
2
votes
1 answer

importing compressed (lzo) data from s3 to hive

I export my DynamoDB tables to s3 as a means of backup (via EMR). When I export, I store the data as lzo compressed file. My hive query is below, but essentially I followed the "To export an Amazon DynamoDB table to an Amazon S3 bucket using data…
rynop
  • 50,086
  • 26
  • 101
  • 112
2
votes
3 answers

Amazon Elastic MapReduce: Output directory

I'm running through Amazon's example of running Elastic MapReduce and keep getting hit with the following error: Error launching job , Output path already exists. Here is the command to run the job that I am…
2
votes
1 answer

Join performance on AWS elastic map reduce running hive

I am running a simple join query select count(*) from t1 join t2 on t1.sno=t2.sno Table t1 and t2 both have 20 million records each and column sno is of string data type. The table data is imported in to HDFS from Amazon s3 in rcfile format. The…
Ahmad Osama
  • 91
  • 1
  • 11
2
votes
1 answer

AWS Elastic Map Reduce: output to SimpleDB

What is the most efficient way to get Elastic Map Reduce output into SimpleDB? I'm aware that I could just output the results to S3, download them, and have a script parse the results and insert into SimpleDB. But is there an easier/faster way…
Suman
  • 9,221
  • 5
  • 49
  • 62
2
votes
2 answers

File not cacheing on AWS Elastic Map Reduce

I'm running the following MapReduce on AWS Elastic MapReduce: ./elastic-mapreduce --create --stream --name CLI_FLOW_LARGE --mapper s3://classify.mysite.com/mapper.py --reducer s3://classify.mysite.com/reducer.py --input …
Ben G
  • 26,091
  • 34
  • 103
  • 170
2
votes
1 answer

Why Elastic MapReduce job flow failed in AWS MapReduce?

I created a job flow in AWS MapReduce, I created a job flow of Contextual Advertising (Hive Script) - done 'Start interactive Hive Session', selected m1.small instances, proceeded without a VPC subnet id and Configure Hadoop in Configure Bootstrap…
Advait
  • 5,771
  • 3
  • 18
  • 18