Questions tagged [spark-redshift]

28 questions
0
votes
1 answer

Best way to process Redshift data on Spark (EMR) via Airflow MWAA?

We have an Airflow MWAA cluster and huge volume of Data in our Redshift data warehouse. We currently process the data directly in Redshift (w/ SQL) but given the amount of data, this puts a lot of pressure in the data warehouse and it is less and…
val
  • 329
  • 2
  • 16
0
votes
1 answer

Unload all table from redshift to s3 - cpu usage

The goal is to unload a few tables (for each customer) every few hours to s3 in parquet format Each table is around 1GB (CSV format), in parquet it is around 120MB The issue is when running 2-3 parallel unloads commands the cpu of the redshift nodes…
0
votes
1 answer

Is it possible to load partitioned parquet files using Redshift COPY command?

For the sake of exemplifying, let's say I have a parquet file in s3 partitioned by column date with the following format: s3://my_bucket/path/my_table/date=* So when I load the table using spark, for example, it shows the…
Henrique Florencio
  • 3,440
  • 1
  • 18
  • 19
0
votes
2 answers

Redshift external catalog error when copying parquet from s3

I am trying to copy Google Analytics data into redshift via parquet format. When I limit the columns to a few select fields, I am able to copy the data. But on including few specific columns I get an error: ERROR: External Catalog Error. Detail:…
0
votes
1 answer

EMR PySpark write to Redshift: java.sql.SQLException: [Amazon](500310) Invalid operation: The session is read-only

I got an error when trying to write data to Redshift using PySpark on an EMR cluster. df.write.format("jdbc") \ .option("url", "jdbc:redshift://clustername.yyyyy.us-east-1.redshift.amazonaws.com:5439/db") \ .option("driver",…
0
votes
1 answer

How to optimize ETL data pipeline for fault tolerance when using Spark and Redshift?

I'm writing a big batch job using PySpark that ETLs 200 tables and loads into Amazon Redshift. These 200 tables are created from one input datasource. So the batch job is successful only when data is loaded into ALL 200 tables successfully. The…
snackbar
  • 93
  • 7
0
votes
0 answers

AWS, dotnet spark, and redshift are not working

Hi I am having problems to get redshift and dotnet spark working: This the configuration I use to get it working on debug mode: C:\bin\spark-2.4.1-bin-hadoop2.7\bin\spark-submit.cmd ` --jars…
0
votes
1 answer

I would like to know whether spark-redshift libraries are open-source/free to use or it has to be licensed via Databricks

I want to use spark-redshift libraries for writing data from AWS S3 to AWS Redshift using the following code. Before using this, I would like to know whether spark-redshift libraries are open-source/free to use or it has to be licensed via…
Sow
  • 71
  • 1
  • 4
0
votes
0 answers

Do we can able to create a daywise Snapshot in target database(redshift) as rows using debezium

Do we can able to create a daywise Snapshot of table in target database as rows using debezium.
user2322440
  • 23
  • 1
  • 6
0
votes
1 answer

Apache Spark 2.4.0, AWS EMR, Spark Redshift and User class threw exception: java.lang.AbstractMethodError

I use Apache Spark 2.4.0, AWS EMR and Spark Redshift and right now faced the following error during reading Redshift table in Spark DataFrame: User class threw exception: java.lang.AbstractMethodError at…
alexanoid
  • 24,051
  • 54
  • 210
  • 410
-1
votes
1 answer

In Redshift SQL query for reducing years

i have data with fields as shown below id grade grade_id year Diff 101 5 7 2022 9 105 k 2 2021 2 106 4 6 2020 5 110 pk 1 2022 1 i want to insert records for same id until we reaches grade = pk , Like shown below for every record in…
-1
votes
1 answer

how to connect from locally installed spark to aws-redshift?

downloaded necessary libraries to connect redshift from locally installed spark cluster and launched pyspark with below command but i am getting below error message. pyspark --conf…
john
  • 51
  • 7
-2
votes
1 answer

Load data from redshift using spark ad scala in an EMR

I am trying to connect redshift using spark with scala in zeppelin from an EMR cluster, I used spark-redshift library but it doesn't work. I tried many solutions and i don't know why it gives an error val df = spark.read…
1
2