Questions tagged [elastic-map-reduce]

Amazon Elastic MapReduce is a web service that enables the processing of large amounts of data.

Amazon Elastic MapReduce is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).

http://aws.amazon.com/elasticmapreduce/

See also

Synonymous tag :

452 questions
7
votes
2 answers

AWS DynamoDB and MapReduce in Java

I have a huge DynamoDB table that I want to analyze to aggregate data that is stored in its attributes. The aggregated data should then be processed by a Java application. While I understand the really basic concepts behind MapReduce, I've never…
Mark
  • 67,098
  • 47
  • 117
  • 162
6
votes
0 answers

How to upgrade Data Pipeline definition from EMR 3.x to 4.x/5.x?

I would like to upgrade my AWS data pipeline definition to EMR 4.x or 5.x, so I can take advantage of Hive's latest features (version 2.0+), such as CURRENT_DATE and CURRENT_TIMESTAMP, etc. The change from EMR 3.x to 4.x/5.x requires the use of…
6
votes
2 answers

Life of distributed cache in Hadoop

When files are transferred to nodes using the distributed cache mechanism in a Hadoop streaming job, does the system delete these files after a job is completed? If they are deleted, which i presume they are, is there a way to make the cache remain…
JD Long
  • 59,675
  • 58
  • 202
  • 294
6
votes
3 answers

How to register S3 Parquet files in a Hive Metastore using Spark on EMR

I am using Amazon Elastic Map Reduce 4.7.1, Hadoop 2.7.2, Hive 1.0.0, and Spark 1.6.1. Use case: I have a Spark cluster used for processing data. That data is stored in S3 as Parquet files. I want tools to be able to query the data using names…
Sam King
  • 2,068
  • 18
  • 29
6
votes
1 answer

More_like_this query with a filter

I have 1702 documents indexed in elastic search which has category as one of the fields and it also has a field named SequentialId. I initially fetched the documents with category 1.1 which are between the document 1 and document 850 like…
Sai
  • 85
  • 1
  • 5
6
votes
2 answers

How to mute apache zookeeper debug messages (AWS EMR)?

How to mute DEBUG messages on AWS Elastic MapReduce Master node? hbase(main):003:0> list TABLE mydb …
6
votes
3 answers

How to read a file from s3 in EMR?

I would like to read a file from S3 in my EMR Hadoop job. I am using the Custom JAR option. I have tried two solutions: org.apache.hadoop.fs.S3FileSystem: throws a NullPointerException. com.amazonaws.services.s3.AmazonS3Client: throws an exception,…
David Nemeskey
  • 640
  • 1
  • 5
  • 16
6
votes
1 answer

Getting "No space left on device" for approx. 10 GB of data on EMR m1.large instances

I am getting an error "No space left on device" when I am running my Amazon EMR jobs using m1.large as the instance type for the hadoop instances to be created by the jobflow. The job generates approx. 10 GB of data at max and since the capacity of…
6
votes
2 answers

Python Dependency Management on EMR

i'm sending code to amazon's EMR via the mrjob/boto modules. i've got some external python dependencies (ie. numpy, boto, etc) and currently have to download the source of the python packages, and send them over as a tarball in the "python_archives"…
follyroof
  • 3,430
  • 2
  • 28
  • 26
6
votes
2 answers

Map Reduce output to CSV or do I need Key Values?

My map function produces a Key\tValue Value = List(value1, value2, value3) then my reduce function produces: Key\tCSV-Line Ex. 2323232-2322 fdsfs,sdfs,dfsfs,0,0,0,2,fsda,3,23,3,s, 2323555-22222 dfasd,sdfas,adfs,0,0,2,0,fasafa,2,23,s Ex.…
6
votes
1 answer

Hive -- split data across files

Is there a way to instruct Hive to split data into multiple output files? Or maybe cap the size of the output files. I'm planning to use Redshift, which recommends splitting data into multiple files to allow parallel loading…
6
votes
1 answer

Understanding a mapreduce algorithm for overlap calculation

I want help understanding the algorithm. I ve pasted the algorithm explanation first and then my doubts. Algorithm:( For calculating the overlap between record pairs) Given a user defined parameter K, the file DR( *Format: record_id, data*) is split…
6
votes
2 answers

The reduce fails due to Task attempt failed to report status for 600 seconds. Killing! Solution?

The reduce phase of the job fails with: of failed Reduce Tasks exceeded allowed limit. The reason why each task fails is: Task attempt_201301251556_1637_r_000005_0 failed to report status for 600 seconds. Killing! Problem in detail: The Map phase…
6
votes
2 answers

Using s3distcp with Amazon EMR to copy a single file

I want to copy just a single file to HDFS using s3distcp. I have tried using the srcPattern argument but it didn't help and it keeps on throwing java.lang.Runtime exception. It is possible that the regex I am using is the culprit, please help. My…
Amar
  • 11,930
  • 5
  • 50
  • 73
6
votes
4 answers

In Hadoop, where can i change default url ports 50070 and 50030 for namenode and jobtracker webpages

There must be a way to change the ports 50070 and 50030 so that the following urls display the clustr statuses on the ports i pick NameNode - http://localhost:50070/ JobTracker - http://localhost:50030/
user836087
  • 2,271
  • 8
  • 23
  • 33
1 2
3
30 31