Questions tagged [amazon-emr]

Amazon Elastic MapReduce (Amazon EMR) is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).

Amazon Elastic MapReduce is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).

3368 questions

vote

1 answer

Spark as execution engine with Hive

Can spark 2.4.2 be used as an execution engine with hive 2.3.4 on Amazon EMR? I have linked the jar files with hive (scala-library, spark-core, spark-common-network) via the following commands: cd $HIVE_HOME/lib ln -s…

asked Jul 01 '19 at 13:45

Shubham Gupta

vote

1 answer

Load only part of HBase/Phoenix table as Spark Datafrom

I am using the following code in Spark to load specified columns of my HBase/Phoenix table into a Spark Dataframe. I can specify the columns I want to load, but can I specify which rows? Or do I have to load all rows? import…

apache-spark hbase amazon-emr apache-phoenix

asked Jun 27 '19 at 22:39

Christopher Ferris

vote

1 answer

error 403 while creating emr cluster using my reducer and mapper?

I am trying to use my bucket to give the arguments for the EMR to create a cluster for it is giving me "All access to this object has been disabled (Service: Amazon S3; Status Code: 403; Error Code: AllAccessDisabled;" I have used my Reducer and…

amazon-web-services amazon-emr

asked Jun 21 '19 at 18:37

Ryan Terry

vote

0 answers

Performance issue while converting pyspark dataframe to JSON

I would like to insert pyspark dataframe content to Redis in an effective way. Trying a couple methods but none of them are giving expected results. Converting df to json takes 30 seconds. The goal is to SET the json payload into Redis cluster for…

pyspark apache-spark-sql amazon-emr spark-redis

asked Jun 18 '19 at 04:21

user2407164

vote

1 answer

Why is AWS CloudFormation throwing "Encountered unsupported property InstanceGroups"?

When I deploy the below AWS CloudFormation script, I am getting the following error: "Encountered unsupported property InstanceGroups" I have used InstanceGroups in the past without any issues. Here is an example of how others using it:…

amazon-web-services apache-spark aws-cloudformation amazon-emr

asked Jun 17 '19 at 16:48

user422930

vote

2 answers

Copy files from S3 to EMR local using Lambda

I need to move the files from S3 to EMR's local dir /home/hadoop programmatically using Lambda. S3DistCp copies over to HDFS. I then login into EMR and run a CopyToLocal hdfs command on commandline to get the files to /home/hadoop. Is there a…

amazon-s3 aws-lambda copy amazon-emr

asked Jun 17 '19 at 01:27

Rhiya

vote

2 answers

How to assign a JSON file as STEP in an EMR cluster in Terraform?

I'm building an EMR cluster in Terraform and in the STEP argument i want to load a JSON file that describes the list of steps. I tried this in my main.tf : ressource "aws_emr" "emr" { ... ... step =…

amazon-web-services terraform amazon-emr

asked Jun 12 '19 at 07:37

user1297406

1,241
1
18
36

vote

1 answer

AWS EMR Presto job

Is it possible to submit presto jobs/steps in any way to an EMR cluster just like you can submit Hive jobs/steps via a script in S3? I would not like to SSH to the instance to execute the commands, but do it automatically

amazon-web-services amazon-emr presto

asked Jun 11 '19 at 11:40

KOT

1,986
3
21
35

vote

0 answers

Amazon EMR w/ Hadoop 3.1

I'm writing a large dataset in Parquet format to HDFS using Spark and it runs rather slowly in EMR vs say Databricks. I realize that if I was able to use Hadoop 3.1, it would be much more performant because it has a high performance output…

apache-spark amazon-emr

asked Jun 10 '19 at 14:30

femibyte

3,317
7
34
59

vote

0 answers

DynamoDB EMR Hive Connector writes 1 item at a time

While writing to dynamodb with on-demand capacity using hive > INSERT OVERWRITE TABLE t SELECT * FROM s3data; I notice that it writes 1 item at a time which is evident from the writecapacity graph below. Here are the settings SET…

hive amazon-dynamodb amazon-emr

asked Jun 06 '19 at 13:37

Somasundaram Sekar

5,244
6
43
85

vote

0 answers

DynamoDB EMR Integration + WARN Task Calculator Warning- Map tasks is less than 1

I'm trying to integrate DynamoDB in EMR spark using the solution provided in AWS blog. https://aws.amazon.com/blogs/big-data/analyze-your-data-on-amazon-dynamodb-with-apache-spark I'm able to retrieve the results as expected . But always the Task…

scala amazon-web-services apache-spark amazon-dynamodb amazon-emr

asked Jun 03 '19 at 08:39

Sudhev Das

vote

0 answers

Correct spark configuration to fully utilise EMR cluster resources

I'm quite new to configuring spark, so wanted to know whether I am fully utilising my EMR cluster. The EMR cluster is using spark 2.4 and hadoop 2.8.5. The app reads loads of small gzipped json files from s3, transforms the data and writes them back…

apache-spark amazon-emr

asked May 29 '19 at 13:48

User

vote

0 answers

How to choose custom AMI for EMR via Airflow

I'm spinning up an EMR cluster via Airflow and run a PySpark job on it. I want to use a Custom AMI to boot up the cluster via Airflow. I'm following the boto3 syntax found in the docs online but the AMI is not being picked up. Is there something…

python-3.6 boto3 airflow amazon-emr amazon-ami

asked May 29 '19 at 10:59

Lia Tasoudi

vote

0 answers

Exception when trying to create bucketed table using Spark with AWS Glue as Metastore

On EMR 5.21.0 with Spark 2.4.0 and AWS Glue as meta store, I'm unable to create a bucketed table using the below syntax CREATE TABLE TABLE_NAME USING PARQUET PARTITIONED BY (abc) CLUSTERED BY (abc) SORTED BY (abc) INTO 50 buckets OPTIONS…

apache-spark amazon-emr aws-glue

asked May 26 '19 at 19:49

Atif

vote

2 answers

EMR always gives me Class Not Found for Scala app

Hi I wanted to test out the EMR custom step feature. I created a simple 2 classes Scala application which writes a text file on S3. Here is the tree ├───src ├───main │ └───scala │ └───com │ └───myorg …

scala jar arguments amazon-emr

asked May 13 '19 at 13:25

3nomis

1,175
1
9
30

Prev 1 2 3

…

99 100 Next