When I run a query like "select count(x),y group by y", calcite does all the calculations in memory. So having enough data it can run out-of-mem. Is there a way to do aggregations using some other storage? There is a spark option but when I enable it I get an nullptr exception. Is that meant to use spark to calculate the results and how does it work?
1 Answers
I would like to talk a bit about my understanding on this.
firstly, calcite is data manipulation engine specialises in SQL optimisation. so it primarily focuses on figuring out the best execution plan.
there have been quite a few adapters on calcite
. you can of course choose to push down the aggregation to backend to execute. like push down the aggregation to backend mysql
etc...
in the case of csv adapter
, I do think calcite will generate execution details to run the aggregation. as you suggested probably all in memory, and if the csv file is large enough, there would be OOM.
and yes, the SPARK option, if turned on. will enable calcite to generate SPAKR
code instead of plain java
code to execute the physical plan. and I assume yes it will to some extent solve the OOM you mentioned.
unfortunately, I haven't found official introduction to use SPARK to run calcite
other than some test specs.
CalciteAssert.that()
.with(CalciteAssert.Config.SPARK)
.query("select *\n"
+ "from (values (1, 'a'), (2, 'b'))")
.returns("EXPR$0=1; EXPR$1=a\n"
+ "EXPR$0=2; EXPR$1=b\n")
.explainContains("SparkToEnumerableConverter\n"
+ " SparkValues(tuples=[[{ 1, 'a' }, { 2, 'b' }]])");

- 5,561
- 5
- 49
- 81