Questions tagged [elastic-map-reduce]

Amazon Elastic MapReduce is a web service that enables the processing of large amounts of data.

Amazon Elastic MapReduce is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).

http://aws.amazon.com/elasticmapreduce/

See also

Synonymous tag :

452 questions
6
votes
2 answers

Amazon Elastic MapReduce - SIGTERM

I have an EMR streaming job (Python) which normally works fine (e.g. 10 machines processing 200 inputs). However, when I run it against large data sets (12 machines processing a total of 6000 inputs, at about 20 seconds per input), after 2.5 hours…
slavi
  • 401
  • 3
  • 10
6
votes
3 answers

Amazon Elastic Map Reduce for analyzing s3 logs

I am using EMR to analyze web nginx logs. But I need to process the logs so that it can fall into rows and columns in order to make it easy for querying. Thus i made two tables - rawlog, processedlog in the following manner: create table rawlog(line…
5
votes
1 answer

Using Distributed Cache with Pig on Elastic Map Reduce

I am trying to run my Pig script (which uses UDFs) on Amazon's Elastic Map Reduce. I need to use some static files from within my UDFs. I do something like this in my UDF: public class MyUDF extends EvalFunc { public DataBag exec(Tuple…
Vivek Pandey
  • 3,455
  • 1
  • 19
  • 25
5
votes
1 answer

Hadoop seems to modify my key object during an iteration over values of a given reduce call

Hadoop Version: 0.20.2 (On Amazon EMR) Problem: I have a custom key that i write during map phase which i added below. During the reduce call, I do some simple aggregation on values for a given key. Issue I am facing is that during the iteration of…
Bhargava
  • 189
  • 3
  • 12
5
votes
0 answers

Spark 2.2 write partitionBy out of memory exception

I think anyone that has used Spark has ran across OOM errors, and usually the source of the problem can be found easily. However, I am a bit perplexed by this one. Currently, I am trying to save by two different partitions, using the partitionBy…
Derek_M
  • 1,018
  • 10
  • 22
5
votes
1 answer

AWS EMR Cluster Streaming Step: Bad Request

I am trying to set up a trivial EMR job to perform word counting of massive text files, stored in s3://__mybucket__/input/. I am unable to correctly add the first of the two required streaming steps (the first is map input to wordSplitter.py, reduce…
Skyler
  • 2,834
  • 5
  • 22
  • 34
5
votes
1 answer

How to map fields in Hive for DynamoDb Amazon Console export?

I am trying to load the DynamoDb export file which is taken from Amazon Dynamodb Web Console with "Import/Export" tool into Hive. But I couldn't map the fields properly because DynamoDB Web Console "Export" tool is using "ETX" "STX". Below is an…
Barbaros Alp
  • 6,405
  • 8
  • 47
  • 61
5
votes
1 answer

What is the best practice to monitor AWS EMR job running progress?

I have following code to run a EMR job, and it runs successfully. And I also want to monitor the running status. I use DescribeJobFlows API, but it says this API is deprecated according to…
coderz
  • 4,847
  • 11
  • 47
  • 70
5
votes
1 answer

Possibility of taking snapshot of AWS EMR cluster or namenode

I am new with AWS services and trying some use-cases. I want to create EMR clusters on demand with some predefined configurations and applications/scripts installed. I was planning to create a snapshot of existing EMR cluster or at-least namenode…
shahsank3t
  • 252
  • 1
  • 13
5
votes
3 answers

Spark/Hadoop throws exception for large LZO files

I'm running an EMR Spark job on some LZO-compressed log-files stored in S3. There are several logfiles stored in the same folder, e.g.: ... s3://mylogfiles/2014-08-11-00111.lzo s3://mylogfiles/2014-08-11-00112.lzo ... In the spark-shell I'm running…
5
votes
1 answer

Can BigQuery's browser interface be white-labeled?

Like most people, we're pretty impressed with BigQuery. We're willing to put up with it being based on proprietary "Dremel" in exchange for not having to configure a ton of servers in our LAN, on EC2, or anywhere else. The REST API is excellent,…
pmueller
  • 313
  • 2
  • 7
5
votes
1 answer

Trouble using hbase from java on Amazon EMR

So Im trying to query my hbase cluster on Amazon ec2 using a custom jar i launch as a MapReduce step. Im my jar (inside the map function) I call Hbase as so: public void map( Text key, BytesWritable value, Context contex ) throws IOException,…
5
votes
1 answer

create hive table from tab separated file in s3 using interactive mode

I've loaded tab separated files into S3 that with this type of folders under the bucket: bucket --> se --> y=2013 --> m=07 --> d=14 --> h=00 each subfolder has 1 file that represent on hour of my traffic. I then created an EMR workflow to run in…
Gluz
  • 3,154
  • 5
  • 24
  • 35
5
votes
3 answers

Is it possible to run hadoop fs -getmerge in S3?

I have an Elastic Map Reduce job which is writing some files in S3 and I want to concatenate all the files to produce a unique text file. Currently I'm manually copying the folder with all the files to our HDFS (hadoop fs copyFromLocal), then I'm…
yeforriak
  • 1,705
  • 2
  • 18
  • 26
4
votes
4 answers

How do you use Python UDFs with Pig in Elastic MapReduce?

I really want to take advantage of Python UDFs in Pig on our AWS Elastic MapReduce cluster, but I can't quite get things to work properly. No matter what I try, my pig job fails with the following exception being logged: ERROR 2998: Unhandled…
Chris Phillips
  • 11,607
  • 3
  • 34
  • 45