Questions tagged [amazon-emr]

Amazon Elastic MapReduce (Amazon EMR) is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).

Amazon Elastic MapReduce is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).

3368 questions

vote

0 answers

MapReduce job with AWS Elastic MapReduce EMR - why 648 MB input was split into 27 map tasks?

I used AWS EMR (Hadoop streaming) for processing 648 MB input data in 9 text files (approx. 72 MB each stored in s3). I thought it split the data into either 64MB or 128MB blocks, but the log says that it split into 27 map tasks (I think one map…

asked May 08 '19 at 02:20

shebang

vote

1 answer

AWS EMR: Spark - SparkException java IOException: Failed to create local dir in /tmp/blockmgr*

I have a AWS EMR cluster with Spark. I can connect to it (spark): from master node after SSHing into it from another AWS EMR cluster But NOT able to connect to it: from my local machine (macOS Mojave) from non-emr machines like Metabase and…

apache-spark hadoop hive amazon-emr beeline

asked May 07 '19 at 11:51

user954311

vote

0 answers

Trigger python script on EMR from Lambda on S3 Object arrival with Object details

I am trying to trigger a lambda function object arrival in s3 alonng with object details like name and path. then trigger python script on EMR which will access the file which is on s3. Please let me know how i can trigger python script (may within…

python amazon-web-services amazon-s3 hive amazon-emr

asked May 07 '19 at 04:04

RajaR

vote

2 answers

How to handle big reference data in Spark

I have big data set (lets say 4gb) that is used as a reference source to process another big data set (100-200gb) I have cluster for 30 executors to do that on 10 nodes. So for every executor I have own jvm, right? Everytime it loads whole reference…

scala apache-spark redis bigdata amazon-emr

asked May 07 '19 at 00:44

jk1

vote

2 answers

Usage of StreamingFileSink is throwing NoClassDefFoundError

I know this could be my problem, but trying to graple it for a while. I am trying to run flink in AWS EMR cluster. My setup is: Time series event from Kinesis -> flink job -> Save it to S3 DataStream kinesis = …

java hadoop-yarn apache-flink amazon-emr

asked May 06 '19 at 02:01

Nischit

vote

0 answers

How to kill all Spark processes from within mapPartitions running on EMR slave nodes?

We are running pyspark in an EMR cluster and have ~50 million records in a dataframe. Each needs a field added to it from an API, which accepts 100 records at a time (so ~500k total requests). We are able to split them up and make the API calls…

python apache-spark pyspark amazon-emr

asked May 03 '19 at 15:16

kylerm42

vote

1 answer

Accessing Hive Tables with Spark SQL

I've setup an AWS EMR cluster that includes spark 2.3.2, hive 2.3.3, and hbase 1.4.7. How can I configure spark to access hive tables? I've taken the following steps, but the result is the error message: java.lang.ClassNotFoundException:…

apache-spark hive apache-spark-sql amazon-emr

asked May 02 '19 at 20:05

Ari

4,121
8
40
56

vote

1 answer

Exceptions while running Spark job on EMR cluster "java.io.IOException: All datanodes are bad"

We have AWS EMR setup to process jobs which are written in Scala. We are able to run the jobs on small dataset, but while running same job on large dataset I get exception "java.io.IOException: All datanodes are bad."

scala amazon-web-services apache-spark amazon-emr

asked Apr 30 '19 at 16:21

Devendra Parhate

vote

0 answers

S3 Bucket Policy to Allow access to specific AWS services and users and restrict other all

I have a bucket policy which is restricting other users to access. But I want, For aws services it should be accessible like EMR etc. I found same question is asked here: S3 Bucket Policy to Allow access to specific users and restrict all . But I…

amazon-web-services amazon-s3 aws-cloudformation amazon-iam amazon-emr

asked Apr 24 '19 at 13:22

ImPurshu

vote

2 answers

Loading parquet file from S3 to DynamoDB

I have been looking at options to load (basically empty and restore) Parquet file from S3 to DynamoDB. Parquet file itself is created via spark job that runs on EMR cluster. Here are few things to keep in mind, I cannot use AWS Data pipeline File…

amazon-web-services amazon-s3 amazon-dynamodb amazon-emr parquet

asked Apr 23 '19 at 05:03

ranjith

vote

0 answers

Samza 1.1.0 - run-app.sh does not work during deployment of hello samza

I am facing errors when I deploy the hello samza tutorial on yarn following the documentation. Particularly, I was getting errors when I run the run-app.sh script as mentioned. I am currently using Samza 1.1.0 on AWS EMR (emr - 5.13.0, amazon 2.8.3,…

amazon-emr apache-samza

asked Apr 17 '19 at 23:08

Harsha

vote

1 answer

Get aws EMR DNS address using CLI

I am trying to set up some easy code to run when trying to spin up an EMR for some ad hoc work I have to do, time to time. Right now I pass the 'aws emr create-cluster' command and then find the DNS in the console, once the cluster is created to…

amazon-web-services command-line-interface amazon-emr

asked Apr 15 '19 at 20:45

Dick McManus

vote

1 answer

How to efficiently read/parse loads of .gz files in a s3 folder with spark on EMR

I'm trying to read all files in a directory on s3 via a spark app that's executing on EMR. The data is store in a typical format like "s3a://Some/path/yyyy/mm/dd/hh/blah.gz" If I use deeply nested wildcards (e.g.…

scala apache-spark hadoop amazon-s3 amazon-emr

asked Apr 15 '19 at 15:22

User

vote

1 answer

Problem importing modules from a .zip file (created in python using zipfile package) with --py-files on an EMR in Spark

I am trying to archive my application in my test file to spark submit on an EMR cluster like this: Folder structure of modules: app --- module1 ------ test.py ------ test2.py --- module2 ------ file1.py ------ file2.py Zip function I'm calling from…

python python-3.x pyspark amazon-emr

asked Apr 13 '19 at 14:11

Collin Rea

vote

1 answer

How to change the Apache Zeppelin UI appearance and make edits to elements

I'm currently running Apache Zeppelin 0.7.2 on an AWS EMR machine. Is there any way to replace the zeppelin logo and words at the top with any other text and images? I tried to use the Inspect Elements feature in Chrome on the Zeppelin Webpage and…

apache amazon-web-services customization amazon-emr apache-zeppelin

asked Apr 11 '19 at 19:26

Arjun Sehgal

Prev 1 2 3

…

99 100 Next