Questions tagged [elastic-map-reduce]

Amazon Elastic MapReduce is a web service that enables the processing of large amounts of data.

Amazon Elastic MapReduce is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).

http://aws.amazon.com/elasticmapreduce/

Synonymous tag : emr

452 questions

votes

2 answers

AWS DynamoDB and MapReduce in Java

I have a huge DynamoDB table that I want to analyze to aggregate data that is stored in its attributes. The aggregated data should then be processed by a Java application. While I understand the really basic concepts behind MapReduce, I've never…

java amazon-web-services mapreduce amazon-dynamodb elastic-map-reduce

asked Apr 08 '12 at 23:05

Mark

67,098
47
117
162

votes

0 answers

How to upgrade Data Pipeline definition from EMR 3.x to 4.x/5.x?

I would like to upgrade my AWS data pipeline definition to EMR 4.x or 5.x, so I can take advantage of Hive's latest features (version 2.0+), such as CURRENT_DATE and CURRENT_TIMESTAMP, etc. The change from EMR 3.x to 4.x/5.x requires the use of…

amazon-web-services amazon-emr elastic-map-reduce amazon-data-pipeline

asked Dec 17 '17 at 18:17

user1322092

4,020
7
35
52

votes

2 answers

Life of distributed cache in Hadoop

When files are transferred to nodes using the distributed cache mechanism in a Hadoop streaming job, does the system delete these files after a job is completed? If they are deleted, which i presume they are, is there a way to make the cache remain…

hadoop amazon-web-services elastic-map-reduce

asked Dec 19 '10 at 15:57

JD Long

59,675
58
202
294

votes

3 answers

How to register S3 Parquet files in a Hive Metastore using Spark on EMR

I am using Amazon Elastic Map Reduce 4.7.1, Hadoop 2.7.2, Hive 1.0.0, and Spark 1.6.1. Use case: I have a Spark cluster used for processing data. That data is stored in S3 as Parquet files. I want tools to be able to query the data using names…

apache-spark hive elastic-map-reduce apache-spark-1.6

asked Jul 21 '16 at 00:36

Sam King

2,068
18
29

votes

1 answer

More_like_this query with a filter

I have 1702 documents indexed in elastic search which has category as one of the fields and it also has a field named SequentialId. I initially fetched the documents with category 1.1 which are between the document 1 and document 850 like…

elasticsearch elasticsearch-plugin elastic-map-reduce

asked Mar 30 '16 at 04:15

Sai

votes

2 answers

How to mute apache zookeeper debug messages (AWS EMR)?

How to mute DEBUG messages on AWS Elastic MapReduce Master node? hbase(main):003:0> list TABLE mydb …

hadoop amazon-web-services apache-zookeeper elastic-map-reduce mute

asked Oct 23 '14 at 00:28

Murat Ozgul

11,193
6
29
32

votes

3 answers

How to read a file from s3 in EMR?

I would like to read a file from S3 in my EMR Hadoop job. I am using the Custom JAR option. I have tried two solutions: org.apache.hadoop.fs.S3FileSystem: throws a NullPointerException. com.amazonaws.services.s3.AmazonS3Client: throws an exception,…

java hadoop amazon-s3 elastic-map-reduce

asked Jun 12 '14 at 12:43

David Nemeskey

votes

1 answer

Getting "No space left on device" for approx. 10 GB of data on EMR m1.large instances

I am getting an error "No space left on device" when I am running my Amazon EMR jobs using m1.large as the instance type for the hadoop instances to be created by the jobflow. The job generates approx. 10 GB of data at max and since the capacity of…

hadoop amazon-web-services amazon-ec2 elastic-map-reduce diskspace

asked Oct 24 '13 at 09:07

Abhishek Jain

4,478
8
34
51

votes

2 answers

Python Dependency Management on EMR

i'm sending code to amazon's EMR via the mrjob/boto modules. i've got some external python dependencies (ie. numpy, boto, etc) and currently have to download the source of the python packages, and send them over as a tarball in the "python_archives"…

python virtualenv pip elastic-map-reduce mrjob

asked Jul 09 '13 at 21:24

follyroof

3,430
2
28
26

votes

2 answers

Map Reduce output to CSV or do I need Key Values?

My map function produces a Key\tValue Value = List(value1, value2, value3) then my reduce function produces: Key\tCSV-Line Ex. 2323232-2322 fdsfs,sdfs,dfsfs,0,0,0,2,fsda,3,23,3,s, 2323555-22222 dfasd,sdfas,adfs,0,0,2,0,fasafa,2,23,s Ex.…

hadoop mapreduce hadoop-streaming elastic-map-reduce

asked Jun 26 '13 at 23:38

Jake Steele

votes

1 answer

Hive -- split data across files

Is there a way to instruct Hive to split data into multiple output files? Or maybe cap the size of the output files. I'm planning to use Redshift, which recommends splitting data into multiple files to allow parallel loading…

amazon-web-services hive elastic-map-reduce amazon-redshift

asked May 08 '13 at 20:28

John Hinnegan

5,864
2
48
64

votes

1 answer

Understanding a mapreduce algorithm for overlap calculation

I want help understanding the algorithm. I ve pasted the algorithm explanation first and then my doubts. Algorithm:( For calculating the overlap between record pairs) Given a user defined parameter K, the file DR( *Format: record_id, data*) is split…

java hadoop mapreduce elastic-map-reduce hadoop-partitioning

asked Mar 10 '13 at 06:05

Mahalakshmi Lakshminarayanan

votes

2 answers

The reduce fails due to Task attempt failed to report status for 600 seconds. Killing! Solution?

The reduce phase of the job fails with: of failed Reduce Tasks exceeded allowed limit. The reason why each task fails is: Task attempt_201301251556_1637_r_000005_0 failed to report status for 600 seconds. Killing! Problem in detail: The Map phase…

java eclipse hadoop mapreduce elastic-map-reduce

asked Mar 07 '13 at 20:42

Mahalakshmi Lakshminarayanan

votes

2 answers

Using s3distcp with Amazon EMR to copy a single file

I want to copy just a single file to HDFS using s3distcp. I have tried using the srcPattern argument but it didn't help and it keeps on throwing java.lang.Runtime exception. It is possible that the regex I am using is the culprit, please help. My…

hadoop amazon-s3 mapreduce elastic-map-reduce emr

asked Nov 21 '12 at 13:38

Amar

11,930
5
50
73

votes

4 answers

In Hadoop, where can i change default url ports 50070 and 50030 for namenode and jobtracker webpages

There must be a way to change the ports 50070 and 50030 so that the following urls display the clustr statuses on the ports i pick NameNode - http://localhost:50070/ JobTracker - http://localhost:50030/

hadoop nosql mapreduce hbase elastic-map-reduce

asked Nov 16 '12 at 19:01

user836087

2,271
8
23
33

Prev 1 2

…

30 31 Next