Questions tagged [elastic-map-reduce]

Amazon Elastic MapReduce is a web service that enables the processing of large amounts of data.

Amazon Elastic MapReduce is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).

http://aws.amazon.com/elasticmapreduce/

Synonymous tag : emr

452 questions

votes

2 answers

Understanding the Hadoop File System Counters

I want to understand the filesystem counters in hadoop. Below are the counters for a job that I ran. In every job that I run, I observe that the Map file bytes read is like almost equal to the HDFS bytes read. And I observe that the file bytes…

java hadoop mapreduce hdfs elastic-map-reduce

asked May 19 '13 at 11:43

Mahalakshmi Lakshminarayanan

votes

1 answer

How can I use s3 object names as inputs to an MRJob mapper, but not the s3 objects themselves?

I'm missing something obvious about Yelp's mrjob job library. Setting up an MRJob class is almost trivially easy. Running it over a file or stdin also so. But how can I change the input to the job from a file either locally or in s3, to, say, keys…

python mapreduce boto elastic-map-reduce mrjob

asked May 16 '13 at 22:11

Christopher

42,720
11
81
99

votes

1 answer

Concatenate S3 files to read in EMR

I have an S3 bucket with log files that I want to concatenate, then use as an input to an EMR job. The log files are in paths like: bucket-name/[date]/product/out/[hour]/[minute-based-file]. I'd like to take all the minute logs in all the hour…

hadoop amazon-web-services amazon-s3 elastic-map-reduce emr

asked May 02 '13 at 23:06

Evan

2,983
8
31
35

votes

3 answers

Hive / ElasticMapreduce: How bring JsonSerDe to ignore malformed JSON?

I'm fairly new to Hive and ElasticMapreduce and currently im stuck to a particular problem. When running a Hive statement on a table with billions of lines of JSON objects, the MapReduce job crashes as soon as only one of those lines is invalid /…

java json hadoop hive elastic-map-reduce

asked Jan 03 '13 at 11:07

saschor

votes

2 answers

ColumnFamilyInputFormat - Could not get input splits

I am getting a weird exception when I try to access Cassandra from hadoop, by using ColumnFamilyInputFormat class. In my hadoop process, this is how I connect to cassandra, after including cassandra-all.jar version 1.1: private void…

hadoop nosql cassandra elastic-map-reduce

asked Nov 26 '12 at 14:33

mvallebr

2,388
21
36

votes

1 answer

Starting AWS elastic mapreduce jobflow from Java API. Where should my hive script go?

I have been developing a data processing application using Amazon Elastic MapReduce and Hive. Now that my Hive scripts work when I SSH and run them using the Interactive Mode Job Flow, I'm trying to create a Job Flow using the AWS Java API. Using…

java amazon-s3 amazon-web-services hive elastic-map-reduce

asked Nov 19 '12 at 21:29

defavantbop

votes

1 answer

When using LZO on Hadoop output on AWS EMR, does it index the files (stored on S3) for future automatic splitting?

I want to use LZO compression on my Elastic Map Reduce job's output that is being stored on S3, but it is not clear if the files are automatically indexed so that future jobs run on this data will split the files into multiple tasks. For example,…

amazon-s3 amazon-web-services elastic-map-reduce lzo

asked Oct 22 '12 at 21:13

Dolan Antenucci

15,432
17
74
100

votes

4 answers

Too many open files in EMR

I am getting the following excpetion in my reducers: EMFILE: Too many open files at org.apache.hadoop.io.nativeio.NativeIO.open(Native Method) at org.apache.hadoop.io.SecureIOUtils.createForWrite(SecureIOUtils.java:161) at…

hadoop mapreduce elastic-map-reduce emr

asked Oct 18 '12 at 11:10

Amar

11,930
5
50
73

votes

1 answer

Why does Nutch only run the fetch step on one Hadoop node, when the cluster has 5 nodes total?

I'm running Nutch on a Elastic MapReduce, with 3 worker nodes. I'm using Nutch 1.4, with the default configuration it ships with (after adding a user agent). However, even though I'm crawling a list of 30,000 domains the fetching step is only run…

hadoop nutch elastic-map-reduce

asked Apr 22 '12 at 00:19

cberner

3,000
3
22
34

votes

1 answer

Minimum AWS policy requirements to run an EMR job

I'd like to run an Elastic Mapreduce on data from the S3 bucket com.test.mybucket, using the MRJob Python framework. However I have lots of other data in S3, and other EC2 instances that I don't want to touch. What is the minimum possible set of…

amazon-web-services elastic-map-reduce mrjob

asked Dec 06 '11 at 19:31

Kevin Burke

61,194
76
188
305

votes

1 answer

R segue createCluster() issue

I'm trying to create a cluster on EC2. I have an account setup and validated with AWS. I have successfully downloaded and installed the segue package and related packages and set my AWS credentials. My problem starts when I try to create a…

r amazon-ec2 elastic-map-reduce

asked Nov 17 '11 at 04:13

screechOwl

27,310
61
158
267

votes

2 answers

Getting data in and out of Elastic MapReduce HDFS

I've written a Hadoop program which requires a certain layout within HDFS, and which afterwards, I need to get the files out of HDFS. It works on my single-node Hadoop setup and I'm eager to get it working on 10's of nodes within Elastic…

hadoop elastic-map-reduce

asked Oct 09 '11 at 05:42

rongenre

1,334
11
21

votes

2 answers

What is the best way to run Lucene/Solr on Hadoop?

We run Solr on an Amazon Web Services EC2 instance with a 1TB EBS volume to store the index so that we can easily launch additional servers with the same (read-only) index. However, our index is soon going to exceed 1TB, and I don't really want to…

lucene solr hadoop mapreduce elastic-map-reduce

asked Jun 01 '11 at 13:19

Joe Emison

votes

2 answers

Amazon MapReduce with cronjob + APIs

I have a website set up on an EC2 instance which lets users view info from 4 of their social networks. Once a user joins, the site should update their info every night, to show up-to-date and relevant information the next day. Initially we had a…

amazon-web-services mapreduce elastic-map-reduce

asked May 21 '11 at 09:27

Andre

4,417
8
32
37

votes

1 answer

AWS Data Pipeline S3 to DynamoDB JSON Error

I'm trying to import a TSV file from S3 into DynamoDB using Data Pipelines, but I keep hitting a MalformedJsonException. I've validated both pieces of Json that I provide: the definition of the data pipeline and the manifest of the S3 folder, so…

amazon-web-services elastic-map-reduce amazon-data-pipeline

asked Jan 25 '18 at 16:17

tghw

25,208
13
70
96

Prev 1 2 3

…

30 31 Next