Questions tagged [emr]

Questions relating to Amazon's Elastic MapReduce (EMR) product.

Amazon Elastic MapReduce is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).

http://aws.amazon.com/elasticmapreduce/

See also

Synonymous tag :

1166 questions
0
votes
2 answers

Why increasing instances number doesn't increase Hive query speed

I created a table using Hive in Amazon's Elastic MapReduce, imported data to it and partitioned it. Now I run a query that counts the most frequent words from one of table fields. I run that query when I had 1 master and 2 core instances and it took…
keepkimi
  • 373
  • 3
  • 12
0
votes
1 answer

Import module in MRJob on EMR

Simple question: I have a module headers.py which defines a couple variables I need in my main MRJob script. I should be able to run the job with python MRMyJob -r emr --file=headers.py s3://input/data/path and then in my MRJob script (MRMyJob),…
Vyassa Baratham
  • 1,457
  • 12
  • 18
0
votes
2 answers

Error running python mrjob word count example

I'm trying to run the example word count map reduce task using mrjob. I get the following error: Traceback (most recent call last): File "mr.py", line 3, in from mrjob.job import MRJob File…
nickponline
  • 25,354
  • 32
  • 99
  • 167
0
votes
1 answer

How to make EMR to keep running

Possible Duplicate: Re-use Amazon Elastic MapReduce instance Can I keep a launched EMR cluster running and keep submitting new jobs to it until I am done (say after a couple of days) and then shut down the cluster or do I have to lanuch my own…
iCode
  • 4,308
  • 10
  • 44
  • 77
0
votes
2 answers

Amazon Hadoop EMR & custom input file format

I am having a bit of trouble getting Amazon EMR accepting a custom InputFileFormat: public class Main extends Configured implements Tool { public static void main(String[] args) throws Exception { int res = ToolRunner.run(new JobConf(),…
jldupont
  • 93,734
  • 56
  • 203
  • 318
0
votes
1 answer

Custom RecordReader in EMR Job

How do I specify a custom RecordReader to use in job flow on Amazon EMR? Note: Hadoop newbie here.
jldupont
  • 93,734
  • 56
  • 203
  • 318
-1
votes
1 answer

How to copy file from HDFS to the local file system of the cluster nodes, in EMR cluster, using java api,

In EMR cluster, using java api, how to copy file from HDFS to the local file system of the cluster nodes?
Rajesh Goel
  • 3,277
  • 1
  • 17
  • 13
-1
votes
1 answer

spark-sql: How to get the progress bar (with stages and tasks)?

How can I get a progressbar on spark-sql? spark-shell get a nice progress bar like this: [Stage7:===========> (14174 + 5) / 62500] This progressbar tells what is the total number of executors allocated, how many are…
-1
votes
1 answer

Looking for examples on how to launch AWS EMR cluster with python to run a pyspark step

I'm looking for an end-to-end example of launching an AWS EMR cluster with a pyspark step and have it automatically terminate when the step is done or fails. I've seen pieces of this explained but not one complete example.
Fred R.
  • 557
  • 3
  • 7
  • 16
-1
votes
1 answer

Convert Json keys into Columns in Spark

I have written a code which reads the data and picks the second element from the tuple. The second element happens to be a JSON. Code to get the JSON: import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import…
Ajay
  • 473
  • 7
  • 25
-1
votes
1 answer

Number of executors and cores

I am new to spark and would like to know how many cores and executors have to be used in a spark job and AWS if we have 2 slave c4.8xlarge nodes and 1 c4.8x large master node. I have tried different combinations but not able to understand the…
Bharath
  • 467
  • 2
  • 8
  • 20
-1
votes
1 answer

Configuring Spark on EMR

When you pick a more performant node, say a r3.xlarge vs m3.xlarge, will Spark automatically utilize the additional resources? Or is this something you need to manually configure and tune? As far as configurations go, which are the most…
flybonzai
  • 3,763
  • 11
  • 38
  • 72
-1
votes
2 answers

connecting sftp server with in AWS

I am trying to create a job to connect sftp server from aws services to bring files into s3 storage in aws. It will be an automated job which runs every night and bring data into S3. I have seen documentation about how to connect aws and import data…
-1
votes
1 answer

Whats the Right way to mange Code deployement and management for AWS

we are on boarding very new on to AWS EMR's and we are looking at the right code repositories and automated code deployment tools. Is there a right tool for doing these where we can manage end-to-end in terms of code deployments. primarily we are…
-1
votes
1 answer

Pig script not working using Amazon EMR

I cannot get this script to work: raw = LOAD 's3://xxxxxxxxx/*' AS (name:chararray, year:float, occurrences:float, books:float); B = GROUP raw BY name; C = FOREACH B GENERATE B.name, (SUM(B.occurrences) / SUM(B.books)) AS average; D = ORDER C BY…
1 2 3
77
78