Questions tagged [emr]

Questions relating to Amazon's Elastic MapReduce (EMR) product.

Amazon Elastic MapReduce is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).

http://aws.amazon.com/elasticmapreduce/

See also

Synonymous tag :

1166 questions
0
votes
1 answer

Loading the map datatype column using python script as reducer using hive

In one of the columns of Hive table, I want to store key-value pairs. Hive's complex data-type map supports that construct. (This is only a toy example of what I want to be able to do, I have many more columns that I want to compress like this) So I…
darshan
  • 1,230
  • 1
  • 11
  • 17
0
votes
0 answers

AWS Elastic Mapreduce optimizing Pig job

I am using boto 2.8.0 to create EMR jobflows over large log file stored in S3. I am relatively new to Elastic Mapreduce and am getting the feel for how to properly handle jobflows from this issue. The logfiles in question are stored in s3 with keys…
DeaconDesperado
  • 9,977
  • 9
  • 47
  • 77
0
votes
1 answer

Hive multiple subqueries and group by

I'm switching statistics from MySQL to Amazon DynamoDB and Elastic MapReduce. I have query bellow that works with MySQL and I have the same table on hive and need the same results as on MySQL (product views for last_week, last_month and…
trkich
  • 79
  • 1
  • 2
  • 9
0
votes
1 answer

Incorrect or incompletely read Value sent to map method in Mapper class

I have a Job that consists of 3 steps. My input is encrypted JSON objects (one per line) stored in Amazon S3. (s3e://). Job…
Kamesh Rao Yeduvakula
  • 1,215
  • 2
  • 15
  • 27
0
votes
1 answer

How to do an "Order of Events" query in Hadoop Hive?

I've been learning Hive over the past 2 months, but I'm having trouble figuring out how to do certain sequence based queries. Take this example: I have a huge log consisting of user actions Every user action has a date field but obviously may not…
David
  • 1,648
  • 1
  • 16
  • 31
0
votes
0 answers

EMR No output for a long time

I have a MapReduce job written in python using MRJob library. The job takes around 30 mins to complete on my local machine. While running the same job on the EMR, I am seeing no output for a long time (~=1hr). I had to close down the job. Also the…
Read Q
  • 1,405
  • 2
  • 14
  • 26
0
votes
1 answer

What is the effort required for migrating from Hadoop 0.20.2 to 0.20.205 and from 0.20.2 to 1.0.1?

I was looking to migrate my EMR implementation from an older version to the latest versions because I am primarily facing a lot of issues. My current implementation uses Hadoop 0.20.2. I wanted to understand how much effort in terms of code change…
Kamesh Rao Yeduvakula
  • 1,215
  • 2
  • 15
  • 27
0
votes
3 answers

Processing logs in Amazon EMR with or without using Hive

I have a lot of log files in my EMR cluster at path 'hdfs:///logs'. Each log entry is multiple lines but have a starting and ending marker to demarcate between two entries. Now, Not all entries in a log file are useful the entries which are useful…
Deepak Garg
  • 366
  • 3
  • 12
0
votes
1 answer

Hive job gets killed and query execute() remains hanging

I am using hive-jdbc-0.7.1-cdh3u5.jar. I have some memory-intensive queries running on EMR which occasionally fail. When I look at the job tracker I see that the query has been killed and I see the following error: java.io.IOException: Task process…
magicalo
  • 463
  • 2
  • 5
  • 12
0
votes
2 answers

Can I access zookeeper from AWS Elastic Mapreduce job

I'm new to Hadoop, and running under AWS Elastic Mapreduce. I need cluster-wide atomic counters in Hadoop and was suggested to use zookeeper for this. I believe zookeeper is part of the Hadoop stack (right?), how would I access it from an Elastic…
David Parks
  • 30,789
  • 47
  • 185
  • 328
0
votes
1 answer

Best practice to add time partitions to a table

having an event tables, partitioned by time (year,month,day,hour) Wanna join a few events in hive script that gets the year,month,day,hour as variables, how can you add for example also events from all 6 hours prior to my time without 'recover…
harelg
  • 61
  • 1
  • 5
0
votes
1 answer

hi1.4xlarge SSD EC2 instance for EMR

I have several hadoop jobs which I run on EMR. A few of those jobs need to process the log files. The log files are huge ~3GB each in .gz format. The logs are stored on S3. Presently, I use m1.xlarge for processing, it takes around 3hours just to…
Kartikeya Sinha
  • 508
  • 1
  • 5
  • 20
0
votes
0 answers

k-means exception on EMR: java.lang.IllegalArgumentException: This file system object does not support access to the request path

I'm trying to run k-means algorythm from mahout on EMR. The input vectorized data is located at S3. My command: elastic-mapreduce --jar s3://mybucket/dir/mahout-examples-0.8-SNAPSHOT-job.jar --main-class org.apache.mahout.driver.MahoutDriver --arg…
denys
  • 2,437
  • 6
  • 31
  • 55
0
votes
1 answer

when is it a good idea to increase/decrease the number of nodes interactively on a hadoop mapreduce job?

I have an intuition that increasing/decreasing number of nodes interactively on running job can speed up map-heavy jobs, but won't help wth reduce heavy jobs, where most of work is done by reduce. There's an faq about this but it doesn't really…
tphyahoo
  • 139
  • 1
  • 6
0
votes
1 answer

DynamoDB S3 Imports

When importing from S3 to DynamoDB, does this count towards provisioned write throughput? I have a service that is only read from, except for batch updates from a multi-gigabyte file in S3. We don't want to pay for provisioned writes all month, and…
DeejUK
  • 12,891
  • 19
  • 89
  • 169