Questions tagged [amazon-emr]

Amazon Elastic MapReduce (Amazon EMR) is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).

Amazon Elastic MapReduce is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).

3368 questions
1
vote
1 answer

Spark as execution engine with Hive

Can spark 2.4.2 be used as an execution engine with hive 2.3.4 on Amazon EMR? I have linked the jar files with hive (scala-library, spark-core, spark-common-network) via the following commands: cd $HIVE_HOME/lib ln -s…
Shubham Gupta
  • 414
  • 7
  • 19
1
vote
1 answer

Load only part of HBase/Phoenix table as Spark Datafrom

I am using the following code in Spark to load specified columns of my HBase/Phoenix table into a Spark Dataframe. I can specify the columns I want to load, but can I specify which rows? Or do I have to load all rows? import…
1
vote
1 answer

error 403 while creating emr cluster using my reducer and mapper?

I am trying to use my bucket to give the arguments for the EMR to create a cluster for it is giving me "All access to this object has been disabled (Service: Amazon S3; Status Code: 403; Error Code: AllAccessDisabled;" I have used my Reducer and…
Ryan Terry
  • 11
  • 1
1
vote
0 answers

Performance issue while converting pyspark dataframe to JSON

I would like to insert pyspark dataframe content to Redis in an effective way. Trying a couple methods but none of them are giving expected results. Converting df to json takes 30 seconds. The goal is to SET the json payload into Redis cluster for…
1
vote
1 answer

Why is AWS CloudFormation throwing "Encountered unsupported property InstanceGroups"?

When I deploy the below AWS CloudFormation script, I am getting the following error: "Encountered unsupported property InstanceGroups" I have used InstanceGroups in the past without any issues. Here is an example of how others using it:…
1
vote
2 answers

Copy files from S3 to EMR local using Lambda

I need to move the files from S3 to EMR's local dir /home/hadoop programmatically using Lambda. S3DistCp copies over to HDFS. I then login into EMR and run a CopyToLocal hdfs command on commandline to get the files to /home/hadoop. Is there a…
Rhiya
  • 271
  • 6
  • 21
1
vote
2 answers

How to assign a JSON file as STEP in an EMR cluster in Terraform?

I'm building an EMR cluster in Terraform and in the STEP argument i want to load a JSON file that describes the list of steps. I tried this in my main.tf : ressource "aws_emr" "emr" { ... ... step =…
user1297406
  • 1,241
  • 1
  • 18
  • 36
1
vote
1 answer

AWS EMR Presto job

Is it possible to submit presto jobs/steps in any way to an EMR cluster just like you can submit Hive jobs/steps via a script in S3? I would not like to SSH to the instance to execute the commands, but do it automatically
KOT
  • 1,986
  • 3
  • 21
  • 35
1
vote
0 answers

Amazon EMR w/ Hadoop 3.1

I'm writing a large dataset in Parquet format to HDFS using Spark and it runs rather slowly in EMR vs say Databricks. I realize that if I was able to use Hadoop 3.1, it would be much more performant because it has a high performance output…
femibyte
  • 3,317
  • 7
  • 34
  • 59
1
vote
0 answers

DynamoDB EMR Hive Connector writes 1 item at a time

While writing to dynamodb with on-demand capacity using hive > INSERT OVERWRITE TABLE t SELECT * FROM s3data; I notice that it writes 1 item at a time which is evident from the writecapacity graph below. Here are the settings SET…
Somasundaram Sekar
  • 5,244
  • 6
  • 43
  • 85
1
vote
0 answers

DynamoDB EMR Integration + WARN Task Calculator Warning- Map tasks is less than 1

I'm trying to integrate DynamoDB in EMR spark using the solution provided in AWS blog. https://aws.amazon.com/blogs/big-data/analyze-your-data-on-amazon-dynamodb-with-apache-spark I'm able to retrieve the results as expected . But always the Task…
1
vote
0 answers

Correct spark configuration to fully utilise EMR cluster resources

I'm quite new to configuring spark, so wanted to know whether I am fully utilising my EMR cluster. The EMR cluster is using spark 2.4 and hadoop 2.8.5. The app reads loads of small gzipped json files from s3, transforms the data and writes them back…
User
  • 168
  • 1
  • 3
  • 19
1
vote
0 answers

How to choose custom AMI for EMR via Airflow

I'm spinning up an EMR cluster via Airflow and run a PySpark job on it. I want to use a Custom AMI to boot up the cluster via Airflow. I'm following the boto3 syntax found in the docs online but the AMI is not being picked up. Is there something…
1
vote
0 answers

Exception when trying to create bucketed table using Spark with AWS Glue as Metastore

On EMR 5.21.0 with Spark 2.4.0 and AWS Glue as meta store, I'm unable to create a bucketed table using the below syntax CREATE TABLE TABLE_NAME USING PARQUET PARTITIONED BY (abc) CLUSTERED BY (abc) SORTED BY (abc) INTO 50 buckets OPTIONS…
Atif
  • 129
  • 1
  • 14
1
vote
2 answers

EMR always gives me Class Not Found for Scala app

Hi I wanted to test out the EMR custom step feature. I created a simple 2 classes Scala application which writes a text file on S3. Here is the tree ├───src ├───main │ └───scala │ └───com │ └───myorg …
3nomis
  • 1,175
  • 1
  • 9
  • 30