Questions tagged [catalyst-optimizer]

Catalyst optimizer makes use of standard features of Scala programming like pattern matching. In the depth, Catalyst contains the tree and the set of rules to manipulate the tree. There are specific libraries to process relational queries. There are various rule sets which handle different phases of query execution like analysis, query optimization, physical planning, and code generation to compile parts of queries to Java bytecode.

27 questions

vote

1 answer

Spark internals: benefits of Project

I've read this question in which the OP tried to convert this logical plan: Aggregate [sum(inc(vals#4L)) AS sum(inc(vals))#7L] +- LocalRelation [vals#4L] To this: Aggregate [sum(inc_val#6L) AS sum(inc(vals))#7L] +- Project [inc(vals#4L) AS…

apache-spark internals catalyst-optimizer

asked Jan 30 '20 at 04:30

Alon

10,381
23
88
152

vote

2 answers

spark register expression for SQL DSL

How can I access a catalyst expression (not regular UDF) in spark SQL scala DSL API? http://geospark.datasyslab.org only allows for text based execution GeoSparkSQLRegistrator.registerAll(sparkSession) var stringDf = sparkSession.sql( """ …

scala apache-spark apache-spark-sql catalyst-optimizer

asked Jul 05 '18 at 20:54

Georg Heiler

16,916
36
162
292

votes

1 answer

Does Spark SQL optimize lower() on both sides?

Say I have this pseudo code in Spark SQL where t1 is a temp view built off of partitioned parquet files in HDFS and t2 is a small lookup file to filter the said temp view select t1.* from t1 where exists(select * from t2 …

apache-spark apache-spark-sql catalyst-optimizer

asked Dec 13 '22 at 19:27

Radagast

5,102
3
12
27

votes

1 answer

Export a spark logical/physical plan?

Can one export a Spark logical or physical plan of a dataframe/set, serialize it and save it somewhere (as text, xml, json ...). Then re-import it, and create a dataframe based on it ? The idea here is, I'm interested in having a metastore for Spark…

apache-spark apache-spark-sql metastore catalyst-optimizer

asked Jun 09 '20 at 15:26

Hamza EL KAROUI

votes

1 answer

Long linear queries in Spark against a graph stored in Hive tables

Suppose I have a graph G and the following query: x y z w q r s (?a)--(?b)--(?c)--(?d)--(?e)--(?f)--(?g)--(?h) where {?a, ?b, ?c, ..., ?h} are variables, and {x, y, z, w, q, r, s} are arc labels. At the storage level I…

scala apache-spark graph apache-spark-sql catalyst-optimizer

asked Jan 22 '20 at 19:24

Anthony Arrascue

votes

1 answer

What happened to the ability to visualize query plans in a Databricks notebook?

There is an old (year 2014) talk on Youtube where the speaker visualized a query plan right inside a Databricks notebook. Here is the screenshot: I am using databricks runtime 5.5 LTS ML and whenever I try to call viz on a query plan, I get this…

apache-spark apache-spark-sql databricks catalyst-optimizer

asked Sep 28 '19 at 13:40

mauna

1,098
13
25

votes

1 answer

Spark optimize "DataFrame.explain" / Catalyst

I've got a complex software which performs really complex SQL queries (well not queries, Spark plans you know). <-- The plans are dynamic, they change based on user input so I can't "cache" them. I've got a phase in which spark takes 1.5-2min…

apache-spark optimization internals catalyst-optimizer

asked Sep 01 '19 at 20:11

BiS

votes

1 answer

Is a select after casting a data frame to dataset optimized?

I have the following scenario: case class A(name:String,age:Int) val df = List(A("s",2)).toDF df.write.parquet("filePath") val result = spark.read.parquet("filePath").as[A].select("age") Is the above optimized to select only age ? Upon seeing…

dataframe apache-spark apache-spark-sql parquet catalyst-optimizer

asked Aug 03 '19 at 16:16

advocateofnone

2,527
3
17
39

votes

1 answer

Spark DataFrame how to preserve sorting and partitioning information after mapPartitions

I use DataFrame mapPartitions in a library which is loosely implementation of the Uber Case Study. The output DataFrame has some new (large) columns, and the input DataFrame is partitioned and internally sorted before doing mapPartitions. Most…

apache-spark apache-spark-sql rdd catalyst-optimizer

asked Jul 31 '19 at 20:52

shay__

3,815
17
34

votes

0 answers

Prevent spark catalyst from optimizing and moving dynamic parallelism

I need to dynamically set spark.sql.shuffle.partitions during the execution of my spark job. Initially, it is set when starting the job, but then after various aggregations, I need to decrease it over and over again. However, catalyst tends to push…

apache-spark parallel-processing apache-spark-sql catalyst-optimizer

asked Jul 24 '19 at 07:46

Georg Heiler

16,916
36
162
292

votes

1 answer

Query Cassandra from Spark using CassandraSQLContext

I try to query Cassandra from Spark using CassandraSQLContext, but I get an weird missing dependency error. I have a Spark application like the following : val spark: SparkSession = SparkSession.builder().appName(appName).getOrCreate() val…

scala apache-spark cassandra apache-spark-sql catalyst-optimizer

asked Mar 20 '19 at 19:15

belgacea

1,084
1
15
33

-1

votes

2 answers

How can one use spark Catalyst?

According to this Spark Catalyst is An implementation-agnostic framework for manipulating trees of relational operators and expressions. I want to use Spark Catalyst to parse SQL DMLs and DDLs to write and generate custom Scala code for. However,…

scala apache-spark apache-spark-sql catalyst-optimizer

asked Jul 09 '18 at 20:52

justin

Prev 1