Questions tagged [emr]

Questions relating to Amazon's Elastic MapReduce (EMR) product.

Amazon Elastic MapReduce is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).

http://aws.amazon.com/elasticmapreduce/

See also

Synonymous tag :

1166 questions
-1
votes
1 answer

How do I create "side-effect" files using Python streaming on AWS Elastic MapReduce?

I'm running a Python streaming job on Amazon's Elastic MapReduce which needs to output multiple files from the reducer. The descriptions I've found on the web of how to do this have all been old, so they reference the deprecated property…
Tom Morris
  • 10,490
  • 32
  • 53
-1
votes
1 answer

Some tasks in map() fails when I run it on AWS

I was running page rank on s3://aws-publicdatasets/common-crawl/parse-output/segment/1346876860819/metadata-XXXX dataset. The program worked when I use 10 files (about 1GB) with 2 m1.medium, but when I use 300 files(20GB) with 5 m3.xlarge instances,…
Tong Wei
  • 1
  • 1
-1
votes
2 answers

Where to run Elastic Map Reduce CLI from or use an alternative to EMR CLI?

Where do I let EMR CLI run as a recommended case? From my local Linux workstation or from a AWS Virtual Server? Ar there (better) alternatives to EMR CLI, in case I want to programmatively access my clusters and perform Map Reduce jobs?
Stephan Kristyn
  • 15,015
  • 14
  • 88
  • 147
-1
votes
1 answer

Override hadoop

I'm running an EMR Activity inside a Data Pipeline analyzing log files and I get the following error when my Pipeline fails: Exception in thread "main" org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory…
cevallos.valtira
  • 191
  • 1
  • 1
  • 8
-1
votes
1 answer

How to process EMR_FORCEUFIMAPPING and EMR_GDICOMMENT?

I am converting EMF to PDF, but I went into a problematic field. I have some EMF spool files, which contain some undocumented EMR structures such as EMR_FORCEUFIMAPPING and EMR_GDICOMMENT. The MSDN is not so descriptive about these records, so…
Robert
  • 53
  • 1
  • 6
-2
votes
1 answer

What is the best approach for this batch spark use case

I am trying to build a solution over S3. I have a lot of files that are dumped every hour to s3. Now in spark I need to process these file and write again back to s3. What is the best approach? One of the approach I have thought of is 1) Whenever a…
Nipun
  • 4,119
  • 5
  • 47
  • 83
-2
votes
1 answer

Will manually Resizing a Running Cluster from AWS console vs decomissioning and comissioning

Will manually Resizing a Running Cluster from AWS console use comissioning and decomissioning process internally? We are working on EMR cluster where we resize cluster manually from aws console which leads to missing /user/oozie/share/lib/ jars some…
Pooja Soni
  • 137
  • 1
  • 2
  • 9
-2
votes
2 answers

Using MapReduce to read the files within a directory

My S3 directory is /sssssss/xxxxxx/rrrrrr/xx/file1 /sssssss/xxxxxx/rrrrrr/xx/file2 /sssssss/xxxxxx/rrrrrr/xx/file3 /sssssss/xxxxxx/rrrrrr/yy/file4 /sssssss/xxxxxx/rrrrrr/yy/file5 /sssssss/xxxxxx/rrrrrr/yy/file6 How my mapreduce program to read…
-3
votes
1 answer

Looking for a way to de-reference a bash var wrapped in a python command call

I'm trying to find a way to de-reference goldenClusterID to use it in an AWS CLI command to terminate my cluster. This program is to compensate for dynamic Job-Flow Numbers generated each day so normal Data Pipeline shutdown via schedule is…
-3
votes
1 answer

How to run parallel clustering using Amazon EMR / Spark from files in a S3

I have 200,000 points in an 1000-dimensional space. If I load all these points using sc.textFile and exhaustively calculated the distance between each point, how can I do it in a parallel manner? Will Spark automatically parallelize the work for me?
Rodrigo Stv
  • 405
  • 3
  • 11
-4
votes
1 answer

How to install mosh on an EMR master

I've been having issues with ssh connections terminating, and thought it might be better to install mosh on the emr master - this way there would be some protection from loss of connectivity. I've created the following script to act as part of my…
theheadofabroom
  • 20,639
  • 5
  • 33
  • 65
1 2 3
77
78