0

I have a table in Redshift and I am trying to create an ETL in AWS Glue Studio , but the size of the data in the table is too large and it seems that even if I apply the filter transform in Glue (both sql filter and the filter transform option ) , it still brings all the data in memory and the job times out after a while . Any idea how I can directly query the small sized sample from the table ?

Need to fetch data from a Redshift table and apply transformations on it , but Glue job takes too long due to the size of the data . Filter options are not working as expected

1 Answers1

0

You are correct, filter actually filters data. You should use limit. Check this solution.

How can I select a stable subset of rows from a Spark DataFrame?

This example is in scala and databricks. Use your own format and load calls.

val df = spark.read.format("com.databricks.spark.csv").load("some_file").toDF("c1", "c2", "c3")

val df_small = df.limit(1000)