Questions tagged [elastic-map-reduce]

Amazon Elastic MapReduce is a web service that enables the processing of large amounts of data.

Amazon Elastic MapReduce is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).

http://aws.amazon.com/elasticmapreduce/

See also

Synonymous tag :

452 questions
4
votes
2 answers

Understanding the Hadoop File System Counters

I want to understand the filesystem counters in hadoop. Below are the counters for a job that I ran. In every job that I run, I observe that the Map file bytes read is like almost equal to the HDFS bytes read. And I observe that the file bytes…
4
votes
1 answer

How can I use s3 object names as inputs to an MRJob mapper, but not the s3 objects themselves?

I'm missing something obvious about Yelp's mrjob job library. Setting up an MRJob class is almost trivially easy. Running it over a file or stdin also so. But how can I change the input to the job from a file either locally or in s3, to, say, keys…
Christopher
  • 42,720
  • 11
  • 81
  • 99
4
votes
1 answer

Concatenate S3 files to read in EMR

I have an S3 bucket with log files that I want to concatenate, then use as an input to an EMR job. The log files are in paths like: bucket-name/[date]/product/out/[hour]/[minute-based-file]. I'd like to take all the minute logs in all the hour…
Evan
  • 2,983
  • 8
  • 31
  • 35
4
votes
3 answers

Hive / ElasticMapreduce: How bring JsonSerDe to ignore malformed JSON?

I'm fairly new to Hive and ElasticMapreduce and currently im stuck to a particular problem. When running a Hive statement on a table with billions of lines of JSON objects, the MapReduce job crashes as soon as only one of those lines is invalid /…
saschor
  • 319
  • 4
  • 12
4
votes
2 answers

ColumnFamilyInputFormat - Could not get input splits

I am getting a weird exception when I try to access Cassandra from hadoop, by using ColumnFamilyInputFormat class. In my hadoop process, this is how I connect to cassandra, after including cassandra-all.jar version 1.1: private void…
mvallebr
  • 2,388
  • 21
  • 36
4
votes
1 answer

Starting AWS elastic mapreduce jobflow from Java API. Where should my hive script go?

I have been developing a data processing application using Amazon Elastic MapReduce and Hive. Now that my Hive scripts work when I SSH and run them using the Interactive Mode Job Flow, I'm trying to create a Job Flow using the AWS Java API. Using…
4
votes
1 answer

When using LZO on Hadoop output on AWS EMR, does it index the files (stored on S3) for future automatic splitting?

I want to use LZO compression on my Elastic Map Reduce job's output that is being stored on S3, but it is not clear if the files are automatically indexed so that future jobs run on this data will split the files into multiple tasks. For example,…
Dolan Antenucci
  • 15,432
  • 17
  • 74
  • 100
4
votes
4 answers

Too many open files in EMR

I am getting the following excpetion in my reducers: EMFILE: Too many open files at org.apache.hadoop.io.nativeio.NativeIO.open(Native Method) at org.apache.hadoop.io.SecureIOUtils.createForWrite(SecureIOUtils.java:161) at…
Amar
  • 11,930
  • 5
  • 50
  • 73
4
votes
1 answer

Why does Nutch only run the fetch step on one Hadoop node, when the cluster has 5 nodes total?

I'm running Nutch on a Elastic MapReduce, with 3 worker nodes. I'm using Nutch 1.4, with the default configuration it ships with (after adding a user agent). However, even though I'm crawling a list of 30,000 domains the fetching step is only run…
cberner
  • 3,000
  • 3
  • 22
  • 34
3
votes
1 answer

Minimum AWS policy requirements to run an EMR job

I'd like to run an Elastic Mapreduce on data from the S3 bucket com.test.mybucket, using the MRJob Python framework. However I have lots of other data in S3, and other EC2 instances that I don't want to touch. What is the minimum possible set of…
Kevin Burke
  • 61,194
  • 76
  • 188
  • 305
3
votes
1 answer

R segue createCluster() issue

I'm trying to create a cluster on EC2. I have an account setup and validated with AWS. I have successfully downloaded and installed the segue package and related packages and set my AWS credentials. My problem starts when I try to create a…
screechOwl
  • 27,310
  • 61
  • 158
  • 267
3
votes
2 answers

Getting data in and out of Elastic MapReduce HDFS

I've written a Hadoop program which requires a certain layout within HDFS, and which afterwards, I need to get the files out of HDFS. It works on my single-node Hadoop setup and I'm eager to get it working on 10's of nodes within Elastic…
rongenre
  • 1,334
  • 11
  • 21
3
votes
2 answers

What is the best way to run Lucene/Solr on Hadoop?

We run Solr on an Amazon Web Services EC2 instance with a 1TB EBS volume to store the index so that we can easily launch additional servers with the same (read-only) index. However, our index is soon going to exceed 1TB, and I don't really want to…
Joe Emison
  • 31
  • 2
3
votes
2 answers

Amazon MapReduce with cronjob + APIs

I have a website set up on an EC2 instance which lets users view info from 4 of their social networks. Once a user joins, the site should update their info every night, to show up-to-date and relevant information the next day. Initially we had a…
Andre
  • 4,417
  • 8
  • 32
  • 37
3
votes
1 answer

AWS Data Pipeline S3 to DynamoDB JSON Error

I'm trying to import a TSV file from S3 into DynamoDB using Data Pipelines, but I keep hitting a MalformedJsonException. I've validated both pieces of Json that I provide: the definition of the data pipeline and the manifest of the S3 folder, so…
tghw
  • 25,208
  • 13
  • 70
  • 96