Questions tagged [elastic-map-reduce]

Amazon Elastic MapReduce is a web service that enables the processing of large amounts of data.

Amazon Elastic MapReduce is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).

http://aws.amazon.com/elasticmapreduce/

See also

Synonymous tag :

452 questions
0
votes
1 answer

Running HIVE queries directly from S3 input files

I am using Interative Hive Session in Elastice Map Reduce to run Hive. Previously I was loading data from S3 into Hive tables.Now, I want to run some scripts on S3 input files without loading data into Hive Tables. Is this possible?If yes then how…
asquare
  • 77
  • 1
  • 3
  • 11
0
votes
1 answer

For a large mapreduce job, with a few lingering reducers, can this job be safely downsized?

Chris Smith answered this question and said I could post it. If you have a 200-node mapreduce job, with just 3 running reduce jobs left lingering, is it safe to switch off all nodes except the master and the 3 with the running jobs? Plus maybe a…
tphyahoo
  • 139
  • 1
  • 6
0
votes
2 answers

Can I access zookeeper from AWS Elastic Mapreduce job

I'm new to Hadoop, and running under AWS Elastic Mapreduce. I need cluster-wide atomic counters in Hadoop and was suggested to use zookeeper for this. I believe zookeeper is part of the Hadoop stack (right?), how would I access it from an Elastic…
David Parks
  • 30,789
  • 47
  • 185
  • 328
0
votes
1 answer

Sessionized web logs, get previous and next domain

We have a large pile of web log data. We need to sessionize it, and also generate the previous domain, and next domain for each session. I am testing via an interactive job flow on AWS EMR. Right now I'm able to get the data sessionized using this…
Dan
  • 5,081
  • 1
  • 18
  • 28
0
votes
1 answer

Load balancing Cascading JDBCTap for MySQL

I am considering writing a Cascading application that issues SELECT statements to MYSQL databases where each query can return millions of rows. Each database exists on N slaves and one master, as shown here:…
0
votes
2 answers

Why increasing instances number doesn't increase Hive query speed

I created a table using Hive in Amazon's Elastic MapReduce, imported data to it and partitioned it. Now I run a query that counts the most frequent words from one of table fields. I run that query when I had 1 master and 2 core instances and it took…
keepkimi
  • 373
  • 3
  • 12
0
votes
2 answers

Can you programmatically control Elastic Mapreduce jobs easily?

There is a command line client written in ruby that is used as the standard. However, it doesn't run in 1.9. There is also a very good aws-sdk for ruby, but it doesn't support EMR. Is there a good alternative?
nkadwa
  • 839
  • 8
  • 16
0
votes
1 answer

How do I pass the Hadoop Streaming -file flag to Amazon ElasticMapreduce?

The -file flag allows you to pack your executable files as a part of job submission and thus allow you to run a MapReduce without first manually copying the executable to S3. Is there a way to use the -file flag with Amazon's elastic-mapreduce…
tibbe
  • 8,809
  • 7
  • 36
  • 64
0
votes
1 answer

Elastic MapReduce fails with: 1: Syntax error: "(" unexpected

I'm trying to run a native binary, compiled on my x86 Debian Squeeze box (to match the Amazon AMI), and I'm consistently getting this weird…
tibbe
  • 8,809
  • 7
  • 36
  • 64
0
votes
1 answer

Performance Impact on Elastic Map reduce for Scale Up vs Scale Out scenario's

I just ran Elastic Map reduce sample application: "Apache Log Processing" Default: When I ran with default configuration (2 Small sized Core instances) - it took 19 minutes Scale Out: Then I ran it with configuration: 8 small sized core instances -…
paras_doshi
  • 1,027
  • 1
  • 12
  • 19
-1
votes
3 answers

Comparing two large datasets using a MapReduce programming model

Let's say I have two fairly large data sets - the first is called "Base" and it contains 200 million tab delimited rows and the second is call "MatchSet" which has 10 million tab delimited rows of similar data. Let's say I then also have an…
j03m
  • 5,195
  • 4
  • 46
  • 50
-1
votes
2 answers

Hive with Tez out of memory error

I have a script which runs fine on hive 13(YARN) I am experimenting with tez. When I run a query on large dataset , I run into the following error. 0 FATAL [Socket Reader #1 for port 55739] org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread…
user2942227
  • 1,023
  • 6
  • 19
  • 26
-1
votes
1 answer

Error while map reduce program in python

I am executing the Map reduce program in python on local system and getting the below error: Password:Traceback (most recent call last): File "./wordcount_mapper.py", line 7, in filename = os.environ["mapreduce_map_input_file"] File…
Aquarius24
  • 1,806
  • 6
  • 33
  • 61
-1
votes
2 answers

Python csv skipping fields with quoted

Trying to do practice on using large data on AWS using mapreduce and python. I have the code import sys import re import csv import glob import string #class MyDialect(csv.Dialect): #strict = True …
Sean Sullivan
  • 329
  • 3
  • 6
  • 17
-1
votes
1 answer

How to write mapreduce program with amazon ec2 and s3

I want to analyse data stored in amazon s3, how can I write java program on amazon emr and access these data. The data url is http://s3.amazonaws.com/aws-publicdatasets/trec/kba/FAKBA1/index.html
1 2 3
30
31