Questions tagged [elastic-map-reduce]

Amazon Elastic MapReduce is a web service that enables the processing of large amounts of data.

Amazon Elastic MapReduce is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).

http://aws.amazon.com/elasticmapreduce/

See also

Synonymous tag :

452 questions
0
votes
1 answer

Amazon Elastic Map Reduce Hadoop Jobs

Im new to Amazon Web Services and Map Reduce staff. My basic problem is I am trying to make an academic project were basically I am processing a large bunch of images and I need to detect a particular object in them. After I need a Map filled by…
0
votes
1 answer

Run Pig with Lipstick on AWS EMR

I'm running an AWS EMR Pig job using script-runner.jar as described here: http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-hadoop-script.html Now, I want to hook up Netflix' Lipstick to monitor my scripts. I set up the server,…
Tim
  • 2,008
  • 16
  • 22
0
votes
1 answer

Is there a way to launch EMR jobs on AWS Virtual Private Cloud.

Is there a way to launch EMR jobs on AWS Virtual Private Cloud. I am planning to launch my AWS Simple workflow which will boot cluster and Add jobs to the clusetr using AWS VPC for some security reason.
user3335406
0
votes
1 answer

Mapreduce output showing all records in same line

I have implemented a mapreduce operation for log file using amazon and hadoop with custom jar. My output shows the correct keys and values, but all the records are being displayed in a single line. For example, given the following pairs: <1387,…
0
votes
1 answer

LeaseExpiredException with custom UDF in Hive

I have a Hive UDF which is supposed to extract the device from an UA string. It uses the ua-parser library: https://github.com/tobie/ua-parser The UDF is rather simple: public class DeviceTypeExtractTest extends UDF{ private Text result = new…
Ana Todor
  • 781
  • 1
  • 6
  • 15
0
votes
1 answer

Force Hive to throw an error on an empty Table

I am using AWS EMR clusters to run Hive. I want to be able to enforce that certain tables should never be empty After initial creation, such as refrence tables, and if they are found to be empty to throw an error (or log a message) and stop…
cbradsh1
  • 493
  • 5
  • 12
0
votes
1 answer

Process entire files using Hadoop streaming on Amazon EMR

I have a directory full of gzipped text files on Amazon S3, and I'm trying to use Hadoop streaming on Amazon Elastic MapReduce to apply a function to each file individually (specifically, parse a multi-line header). The default Hadoop streaming…
0
votes
1 answer

Unable to load Hive-JDBC driver when accessed through MapReduce program on Amazon's Elastic MapReduce

I have written a MapReduce program in which I am storing some part of output data into Hive table. I have used Hive-JDBC driver to access Hive table via MapReduce code. This program has compiled successfully on local machine. After this, I created…
0
votes
1 answer

Issue with using files in distributed cache in Elastic MapReduce

I'm trying to make use of an external library in my Python mapper script in an AWS Elastic MapReduce job. However, my script doesn't seem to be able to find the modules in the cache. I archived the files into a tarball called helper_classes.tar and…
0
votes
1 answer

R Reducer is not working properly in Amazon EMR

I have done a map reduce code in R to run in Amazon EMR. My input file format: URL1 word1 word2 word3 URL2 word4 word2 word3 URL3 word1 word7 word2 I'm expecting the output as: URLs are concat with spaces word1 URL1 URL3 word2 URL1 URL2…
Nadaraj
  • 509
  • 1
  • 7
  • 14
0
votes
1 answer

Map Error- Attempy_xxxx_ Timed out after 600 seconds

I'm using Hadoop 2.2.0 and in when I run my map tasks I get the following error attempt_xxx Timed out after 1800000 seconds (its 1800000 because I have changed the config for mapreduce.task.timeout). Below is my map code: public class MapTask { …
0
votes
1 answer

"Invalid option" error when passing arguments to EMR Bootstrap Action

I'm programatically provisioning an EMR cluster using the Java SDK, and am trying to pass arguments to the setup-impala script. The code I have looks like this: ... List bootstrapActions = new…
mindcrime
  • 657
  • 8
  • 23
0
votes
1 answer

mmh3 not installed on Elastic MapReduce in AWS

I need to use mmh3 for hashing. However, when I run "python MultiwayJoin.py R.csv S.csv T.csv -r emr > output.txt" in terminal, it returned an error said that: File "MultiwayJoin.py", line 5, in import mmh3 ImportError: No module named mmh3
0
votes
3 answers

How is data partitioned and distributed among datanodes in MapReduce?

I'm new to MapReduce, I'm having the task to process large data(lines of records). One thing I should use is the line number of specific record in my mapper, and then reducer process the line number information based on the mapper. For instance,…
i3wangyi
  • 2,279
  • 3
  • 15
  • 12
0
votes
1 answer

bulk indexing in elasticsearch issue

I am trying to index a file by using below code: But I am wondering why it is not happening: Could any body explain the reason for not indexing. public static void main(String[] args) throws IOException { String line; List l=new…
Amaresh
  • 3,231
  • 7
  • 37
  • 60