Questions tagged [emr]

Questions relating to Amazon's Elastic MapReduce (EMR) product.

Amazon Elastic MapReduce is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).

http://aws.amazon.com/elasticmapreduce/

Synonymous tag : elastic-map-reduce amazon-emr

1166 questions

votes

1 answer

Loading the map datatype column using python script as reducer using hive

In one of the columns of Hive table, I want to store key-value pairs. Hive's complex data-type map supports that construct. (This is only a toy example of what I want to be able to do, I have many more columns that I want to compress like this) So I…

python amazon-s3 hive emr

asked Mar 27 '13 at 08:34

darshan

1,230
1
11
17

votes

0 answers

AWS Elastic Mapreduce optimizing Pig job

I am using boto 2.8.0 to create EMR jobflows over large log file stored in S3. I am relatively new to Elastic Mapreduce and am getting the feel for how to properly handle jobflows from this issue. The logfiles in question are stored in s3 with keys…

amazon-web-services apache-pig boto emr

asked Mar 19 '13 at 13:33

DeaconDesperado

9,977
9
47
77

votes

1 answer

Hive multiple subqueries and group by

I'm switching statistics from MySQL to Amazon DynamoDB and Elastic MapReduce. I have query bellow that works with MySQL and I have the same table on hive and need the same results as on MySQL (product views for last_week, last_month and…

hive amazon-dynamodb emr hiveql

asked Mar 04 '13 at 12:09

trkich

votes

1 answer

Incorrect or incompletely read Value sent to map method in Mapper class

I have a Job that consists of 3 steps. My input is encrypted JSON objects (one per line) stored in Amazon S3. (s3e://). Job…

hadoop amazon-s3 amazon-emr emr

asked Jan 31 '13 at 13:28

Kamesh Rao Yeduvakula

1,215
2
15
27

votes

1 answer

How to do an "Order of Events" query in Hadoop Hive?

I've been learning Hive over the past 2 months, but I'm having trouble figuring out how to do certain sequence based queries. Take this example: I have a huge log consisting of user actions Every user action has a date field but obviously may not…

hadoop hive emr hiveql

asked Jan 26 '13 at 04:53

David

1,648
1
16
31

votes

0 answers

EMR No output for a long time

I have a MapReduce job written in python using MRJob library. The job takes around 30 mins to complete on my local machine. While running the same job on the EMR, I am seeing no output for a long time (~=1hr). I had to close down the job. Also the…

python hadoop mapreduce emr mrjob

asked Jan 18 '13 at 11:42

Read Q

1,405
2
14
26

votes

1 answer

What is the effort required for migrating from Hadoop 0.20.2 to 0.20.205 and from 0.20.2 to 1.0.1?

I was looking to migrate my EMR implementation from an older version to the latest versions because I am primarily facing a lot of issues. My current implementation uses Hadoop 0.20.2. I wanted to understand how much effort in terms of code change…

hadoop amazon-ami emr

asked Dec 26 '12 at 06:02

Kamesh Rao Yeduvakula

1,215
2
15
27

votes

3 answers

Processing logs in Amazon EMR with or without using Hive

I have a lot of log files in my EMR cluster at path 'hdfs:///logs'. Each log entry is multiple lines but have a starting and ending marker to demarcate between two entries. Now, Not all entries in a log file are useful the entries which are useful…

hadoop hive hadoop-streaming emr

asked Dec 21 '12 at 04:43

Deepak Garg

votes

1 answer

Hive job gets killed and query execute() remains hanging

I am using hive-jdbc-0.7.1-cdh3u5.jar. I have some memory-intensive queries running on EMR which occasionally fail. When I look at the job tracker I see that the query has been killed and I see the following error: java.io.IOException: Task process…

amazon-ec2 hive amazon-emr emr

asked Dec 19 '12 at 21:32

magicalo

votes

2 answers

Can I access zookeeper from AWS Elastic Mapreduce job

I'm new to Hadoop, and running under AWS Elastic Mapreduce. I need cluster-wide atomic counters in Hadoop and was suggested to use zookeeper for this. I believe zookeeper is part of the Hadoop stack (right?), how would I access it from an Elastic…

hadoop amazon-web-services apache-zookeeper elastic-map-reduce emr

asked Oct 27 '12 at 03:46

David Parks

30,789
47
185
328

votes

1 answer

Best practice to add time partitions to a table

having an event tables, partitioned by time (year,month,day,hour) Wanna join a few events in hive script that gets the year,month,day,hour as variables, how can you add for example also events from all 6 hours prior to my time without 'recover…

hive emr hiveql

asked Oct 22 '12 at 11:40

harelg

votes

1 answer

hi1.4xlarge SSD EC2 instance for EMR

I have several hadoop jobs which I run on EMR. A few of those jobs need to process the log files. The log files are huge ~3GB each in .gz format. The logs are stored on S3. Presently, I use m1.xlarge for processing, it takes around 3hours just to…

hadoop amazon-s3 amazon-ec2 solid-state-drive emr

asked Oct 11 '12 at 09:23

Kartikeya Sinha

votes

0 answers

k-means exception on EMR: java.lang.IllegalArgumentException: This file system object does not support access to the request path

I'm trying to run k-means algorythm from mahout on EMR. The input vectorized data is located at S3. My command: elastic-mapreduce --jar s3://mybucket/dir/mahout-examples-0.8-SNAPSHOT-job.jar --main-class org.apache.mahout.driver.MahoutDriver --arg…

java hadoop mahout amazon-emr emr

asked Oct 10 '12 at 14:45

denys

2,437
6
31
55

votes

1 answer

when is it a good idea to increase/decrease the number of nodes interactively on a hadoop mapreduce job?

I have an intuition that increasing/decreasing number of nodes interactively on running job can speed up map-heavy jobs, but won't help wth reduce heavy jobs, where most of work is done by reduce. There's an faq about this but it doesn't really…

hadoop mapreduce emr

asked Oct 09 '12 at 16:26

tphyahoo

votes

1 answer

DynamoDB S3 Imports

When importing from S3 to DynamoDB, does this count towards provisioned write throughput? I have a service that is only read from, except for batch updates from a multi-gigabyte file in S3. We don't want to pay for provisioned writes all month, and…

amazon-s3 amazon-web-services amazon-dynamodb amazon-emr emr

asked Sep 07 '12 at 12:01

DeejUK

12,891
19
89
169

Prev 1 2 3

…

77 78 Next