Questions tagged [elastic-map-reduce]

Amazon Elastic MapReduce is a web service that enables the processing of large amounts of data.

Amazon Elastic MapReduce is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).

http://aws.amazon.com/elasticmapreduce/

See also

Synonymous tag :

452 questions
1
vote
0 answers

how to send jars to emr hosts using aws java sdk?

Is there any other way to send my jar to EMR via the java sdk? Or are the only options to scp or to upload the jar to S3 and call it from there?
Julian
  • 483
  • 1
  • 6
  • 17
1
vote
1 answer

Running Map Reduce on a data set of around 10 GB on AWS

I want to store around 10 GB of data on AWS services and use map reduce to process the data. Is using EC2 the best option ? I want to use free tier service, it says maximum of 613 MB for free services on EC2 and that does not satisfy my…
user1524625
  • 271
  • 1
  • 7
  • 19
1
vote
1 answer

How to set heap size for EMR Master

I have a job which I am trigger from in EMR. The master triggers the mapper. Once it is done, it loads a heavweight operation in memory and then evenutualy will dump out. Right now, the job which runs on the cluster fails after a few minutes because…
user2655578
  • 11
  • 1
  • 4
1
vote
0 answers

How to execute shell commands in pig script on amazon Elastic Map Reduce?

By using bootstrap i was moving some source files to master node. While creating the jobflow through elastic-mapreduce-client, I will pass a pig script, that will launch embedded python from the source files that present in master node. following…
1
vote
1 answer

Can't list the current job flow in Elastic map reduce Command line Tools?

I have installed the Amazon Elastic Map Reduce Command Line Tools successfully. While listing the current job flow, by using the below command $ ./elastic-mapreduce --list It throws the following Error. Error: Request has expired. Timestamp date:…
1
vote
2 answers

Elastic Map Reduce Error

I am getting an error when using Elastic Map Reduce and I am not sure what it means because it is not very descriptive. I want to know specifically what kind of JSONDecodeError I am getting. "12" is not descriptive. This is the output. I am using…
user1011332
  • 773
  • 12
  • 27
1
vote
2 answers

jar containing org.apache.hadoop.hive.dynamodb

I was trying to programmatically Load a dynamodb table into HDFS (via java, and not hive), I couldnt find examples online on how to do it, so thought I'd download the jar containing org.apache.hadoop.hive.dynamodb and reverse engineer the…
n915
  • 81
  • 1
  • 1
  • 5
1
vote
1 answer

Best way to split log files

Need help and this seems like such a common task to do: We have hourly huge logfiles containing many different events. We have been using hive to split these events to different files, in a hard coded way: from events insert overwrite table…
harelg
  • 61
  • 1
  • 5
1
vote
1 answer

Run a custom MapReduce Jar in Amazon Elastic Map Reduce against data from Amazon DynamoDB

I have data in DynamoDB which I want to run mapreduce jobs against. I've found a lot of tutorials which involve using Hive to run SQL against the dynamoDB data in EMR, but for the task I'm trying to perform it will be very difficult to efficiently…
David Chanin
  • 533
  • 6
  • 17
1
vote
2 answers

Sharing data between master and reduce

I need to perform aggregation using the results form all the reduce tasks. Basically the reduce task finds the sum and count and a value. I need to add all the sums and counts and find the final average. I tried using conf.setInt in reduce. But when…
1
vote
2 answers

Output Folders for Amazon EMR

I want to jun a custom jar, whose main class a chain of map reduce jobs, with the output of the first job going as the input of the second jar, and so on. What do I set in FileOutputFormat.setOutputPath("what path should be here?"); If I specify…
1
vote
1 answer

Can't pipe two hadoop commands?

I want to run the following command: hadoop fs -ls hdfs:///logs/ | grep -oh "/[^/]*.gz" | grep -oh "[^/]*.gz" | hadoop fs -put - hdfs:///unzip_input/input It works when I call it from the shell after I ssh onto the master node. But it will not…
Shane
  • 2,315
  • 3
  • 21
  • 33
1
vote
1 answer

Hadoop UniqValueCount Map and Aggregate Reducer for Large Dataset (1 billion records)

I have a data set that has approximately 1 billion data points. There are about 46 million unique data points I want to extract from this. I want to use Hadoop to extract the unique values, but keep getting "Out of Memory" and Java heap size errors…
Suman
  • 9,221
  • 5
  • 49
  • 62
1
vote
1 answer

Python: Increasing timeout value in EMR using yelps MRJOB

I am using the yelp MRjob for writing some of the mapreduce programs. I am running it on EMR. My program has reducer code which takes a long time to execute. I am noticing that because of the default timeout period in EMR I am getting this error…
Read Q
  • 1,405
  • 2
  • 14
  • 26
1
vote
1 answer

Run a bootstrap action on an existing job flow

I have a job flow with keep-alive set, on which I want to run several bootstrap actions. One such action is a script that builds and installs Python 3.3. However the elastic-mapreduce CLI only allows for bootstrap actions to be run during job flow…
Matt Joiner
  • 112,946
  • 110
  • 377
  • 526