Questions tagged [elastic-map-reduce]

Amazon Elastic MapReduce is a web service that enables the processing of large amounts of data.

Amazon Elastic MapReduce is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).

http://aws.amazon.com/elasticmapreduce/

See also

Synonymous tag :

452 questions
1
vote
2 answers

Problems using distcp and s3distcp with my EMR job that outputs to HDFS

I've run a job on AWS's EMR, and stored the output in the EMR job's HDFS. I am then trying to copy the result to S3 via distcp or s3distcp, but both are failing as described below. (Note: the reason I'm not just sending my EMR job's output directly…
Dolan Antenucci
  • 15,432
  • 17
  • 74
  • 100
1
vote
1 answer

Where is my AWS EMR reducer output for my completed job (should be on S3, but nothing there)?

I'm having an issue where my Hadoop job on AWS's EMR is not being saved to S3. When I run the job on a smaller sample, the job stores the output just fine. When I run the same command but on my full dataset, the job completes again, but there is…
1
vote
3 answers

Write some data (lines) from my mappers to separate directories depending on some logic in my mapper code

I am using mrjob for my EMR needs. How do I write some data (lines) from my mappers to "separate directories" depending on some logic in my mapper code that I can: tar gzip and upload to separate S3 buckets (depending on the directory name) after…
newToFlume
  • 51
  • 1
  • 8
1
vote
3 answers

How do I make sure RegexSerDe is available to my Hadoop nodes?

I'm trying to attack the problem of analyzing web logs with Hive, and I've seen plenty of examples out there, but I can't seem to find anyone with this specific issue. Here's where I'm at: I've set up an AWS ElasticMapReduce cluster, I can log in,…
awshepard
  • 2,627
  • 1
  • 19
  • 24
1
vote
2 answers

Creating a large covariance matrix

I need to create ~110 covariance matrices of doubles size 19347 x 19347 then add them all together. This in itself isn't very difficult and for smaller matrices the following code works fine. covmat <- matrix(0, ncol=19347,…
TrueWheel
  • 997
  • 2
  • 20
  • 35
1
vote
3 answers

How to use external data with Elastic MapReduce

From Amazon's EMR FAQ: Q: Can I load my data from the internet or somewhere other than Amazon S3? Yes. Your Hadoop application can load the data from anywhere on the internet or from other AWS services. Note that if you load data from the internet,…
1
vote
1 answer

Need suggestion on using Map/Reduce to create solr index

I'm pretty new to Map/Reduce world and trying to evaluate the best option to figure if I can leverage it to create index in Solr. Currently, I'm using a regular crawl to fetch data and index it in Solr directly. This is working without any issues.…
Shamik
  • 1,671
  • 11
  • 36
  • 64
1
vote
2 answers

Ganglia and Amazon Elastic Map Reduce - install issues

Following the instructions for "Initializing Ganglia on a Job Flow" I get my cluster up but don't see any Ganglia process running (on 8157). …
Tom Emmons
  • 103
  • 1
  • 7
0
votes
1 answer

Setting jobconf parameters with Karmasphere Analyst & Amazon Elastic MapReduce

The Karmasphere Analyst profiler has suggested that I set some jobconf parameters, for example, mapred.map.output.compression.codec=org.apache.hadoop.io.compress.LzoCodec. But I don't know where to set these and I can't find it in the Karmasphere…
Vinay
  • 61
  • 7
0
votes
2 answers

Can't get --supported-products option to work with Amazon's elastic-mapreduce Ruby client for Karmasphere Analytics

I am trying to use Karmaspere Analytics with AWS. This page says to use --supported-products with the ruby client. However, when I run the command (exactly as entered on that page), I get an error "Error: invalid option: --supported-products" I am…
Vinay
  • 61
  • 7
0
votes
1 answer

Force one reducer in AWS EMR

How do I ensure that there's only reducer for my EMR Streaming job? Is there any way to do this from the web frontend when I'm creating a new Jobflow?
jetru
  • 1,964
  • 4
  • 16
  • 24
0
votes
1 answer

Starting jobs with direct calls to Hadoop from within SSH

I've been able to kick off job flows using the elastic-mapreduce ruby library just fine. Now I have an instance which is still 'alive' after it's jobs have finished. I've logged in to is using SSH and would like to start another job, but each of my…
Trindaz
  • 17,029
  • 21
  • 82
  • 111
0
votes
1 answer

What's the best way to do set-membership tests in hadoop?

I'm using hadoop to process a sequence of analytics records for my application. I want to categorise users based on which events I see in their stream and then use that information in a later stage when iterating over the stream again. For…
Fasaxc
  • 756
  • 4
  • 17
0
votes
1 answer

java.lang.RuntimeException: java.lang.ClassNotFoundException when trying to run Jar job on Elastic MapReduce

What should I change to fix following error: I'm trying to start a job on Elastic Mapreduce, and it crashes every time with message: java.lang.RuntimeException: java.lang.ClassNotFoundException: iataho.mapreduce.NewMaxTemperatureMapper at…
Arsen Zahray
  • 24,367
  • 48
  • 131
  • 224
0
votes
1 answer

Spark: Reporting Total, and Available Memory of the Cluster

I'm running a Spark job on an Amazon EMR; I would like to keep reporting the total, and free memory of the cluster from within the program itself. Is there any method in Spark API which provides information about the cluster's memory?