Questions tagged [elastic-map-reduce]

Amazon Elastic MapReduce is a web service that enables the processing of large amounts of data.

Amazon Elastic MapReduce is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).

http://aws.amazon.com/elasticmapreduce/

See also

Synonymous tag :

452 questions
1
vote
1 answer

HBase HFile Corruption on AWS S3

I am running HBase on an EMR cluster (emr-5.7.0) enabled on S3. We are using 'ImportTsv' and 'CompleteBulkLoad' utilities for importing the data into HBase. During our process, we have observed that intermittently there were failures stating that…
Sridher
  • 201
  • 3
  • 11
1
vote
1 answer

Clear data from HDFS on AWS EMR in Hadoop 1.0.3

For various reasons I'm running some jobs on EMR with AMI 2.4.11/Hadoop 1.0.3. I'm trying to run a cleanup of HDFS after my jobs by adding an additional EMR step. Using boto: step = JarStep( 'HDFS cleanup', 'command-runner.jar', …
Chet
  • 21,375
  • 10
  • 40
  • 58
1
vote
2 answers

When can we init resources for a hadoop Mapper?

I have a small sqlite database (post code -> US city name) and I have a big S3 file of users. I would like to map every user to the city name associated to their postcode. I follow the famous WordCount.java example but Im not sure how mapReduce…
Thomas
  • 8,306
  • 8
  • 53
  • 92
1
vote
0 answers

Run map reduce program in my eclipse but it is always do spilling

I have written a MapReduce program. At first it was running fine, but after a while, I changed something then suddenly my computer said my computer have no memory. Then I realize the job I have run used lots of memory and I don't know why. And …
JEUDominic
  • 11
  • 3
1
vote
1 answer

Amazon EMR MapReduce progress rollback?

Hi I just came up with a strange task: I run a java-MapReduce jobs with EMR. The data was about 1T and I used 1 master + 8 slaves. All of the instances are r2.2xlarge. Initially, everything looks fine like below: INFO mapreduce.Job: map 0% reduce…
1
vote
1 answer

Elasticsearch master slave cofiguration

How to configure elasticsearch in master node and data node?What is the difference between both type of elasticsearch cluster ?How we get beneficial in elasticsearch with hadoop?
1
vote
2 answers

on EMR Spark, JDBC load fails first time, then works

I'm using spark-shell with Spark 2.1.0 in AWS Elastic Map Reduce 5.3.1 to load data from a Postgres database. loader.load always fails and then succeeds. Why would this happen? [hadoop@[SNIP] ~]$ SPARK_PRINT_LAUNCH_COMMAND=1 spark-shell…
rcrogers
  • 2,281
  • 1
  • 17
  • 14
1
vote
1 answer

Can I force YARN to use the master node for the Application Master container?

My big ol' master node hardware is doing practically nothing during my Hadoop/Spark runs because YARN uses a random slave node for its AM on each task. I like the old Hadoop 1 way better; lots of log chasing and ssh pain was avoided that way when…
Judge Mental
  • 5,209
  • 17
  • 22
1
vote
1 answer

In AWS EMR how do I log the classpath to debug classloader issues

I am in Classloader hell - Hadoop (up to 2.7.2) uses an out-dated version of HttpClient (4.2.5) https://hadoop.apache.org/docs/r2.7.2/hadoop-mapreduce-client/hadoop-mapreduce-client-core/dependency-analysis.html This is clashing with the version of…
kellyfj
  • 6,586
  • 12
  • 45
  • 66
1
vote
1 answer

How to create elastic search template with respect to dynamic index type

I am trying to create the Elastic search dynamic template with respect to index type (by date, index will be created by date pertition) My sample index URL will be…
1
vote
1 answer

.persist() line sometimes leads to Java Out of Heap Space error

As far as I know, when you use .persist(), writing the line persist sets only the persistence level, and then the next action in the script will cause the actual persistence work to be invoked. However, sometimes, seemingly depending on the…
Kristian
  • 21,204
  • 19
  • 101
  • 176
1
vote
2 answers

How to manually make an AWS EMR step fail

I came across a problem and thought of a question I did not find a good answer to. And that is, how can I purposely make an AWS EMR step fail? I have a Spark Scala script which is added as a Spark step with some command line arguments and the output…
V. Samma
  • 2,558
  • 8
  • 30
  • 34
1
vote
1 answer

Amazon S3 Error Code: 400 while running mr-job on EMR

Got this error running a custom jar on EMR. Exception in thread "main" com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: Bad Request (Service: Amazon S3; Status Code: 400; Error Code: 400 Bad Request; Request ID:…
1
vote
2 answers

Mapping a range of warc.gz files, EMR

I have been running a streaming step in AWS/EMR with a mapper and reducer written in Python to map some of the archives in Common Crawl for sentiment analysis. I am moving from the older common crawl textData format to the newer warc.gz format and…
DataGuy
  • 1,695
  • 4
  • 22
  • 38
1
vote
0 answers

--jars from different locations causes different jdbc behavior

When I load a MySQL JDBC driver by first copying it to the driver, and then including it via --jars /path/to/jdbc/driver.jar, then referencing that jdbc driver and loading data into a dataframe succeeds. $ pyspark --jars /path/to/jdbc/driver.jar >>>…
Kristian
  • 21,204
  • 19
  • 101
  • 176