0

there is a service API that allows retrieving some Books, and we would like to build a Apache Calcite table on top of this service; this API only accepts a set of IDs, and if no IDs are provided, no result is returned.

 interface BookService {
   /** Returns the books matching the given IDs.
    *
    * <p>If not IDs is provided, no result is returned.
    */
   List<Book> findBooksByIds(Set<String> ids);
 }

 record Book(String id, String title) {};

A requirement of this table is that it has to support join, and we would like to support joining by ID efficiently; but currently this is not possible, because in case of joins, no filters is pushed down to this API; for example, given the query:

WITH config (id, val) AS (
   VALUES ('a', 3), ('b', 5)
)
SELECT * FROM books b
INNER JOIN config ON b.id = config.id

Calcite produces the following plan:

LogicalProject(id=[$6], val=[$7], id=[$0], author=[$1])
  LogicalJoin(subset=[rel#58:RelSubset#6.NONE.[]], condition=[=($8, $5)], joinType=[inner])
    LogicalProject(subset=[rel#34:RelSubset#3.NONE.[]], id=[$0], author=[$1], id0=[CAST($2):VARCHAR])
      LogicalTableScan(subset=[rel#32:RelSubset#2.NONE.[]], table=[[s, books]])
    LogicalProject(subset=[rel#31:RelSubset#1.NONE.[]], EXPR$0=[$0], EXPR$1=[$1], EXPR$00=[CAST($0):VARCHAR NOT NULL])
      LogicalValues(subset=[rel#29:RelSubset#0.NONE.[]], tuples=[[{ 'a', 3 }, { 'b', 5 }]])

As you can see, this always results in an empty result, since the join condition is not transformed to filters that can be pushed down to the TableScan.

The goal is to define a rule that forces the query planner to always push down join predicates into a table scan for this books table.

Postgres, under some circumstances, generates nested-loop join plan: it first selects the rows of the table A matching a given condition, then iterates over the retrieved rows and performs a scan of table B looking for rows that match the join condition; and this seems something we would like to enforce here. Checking the set of available rules in Apache Calcite, I see the EnumerableBatchNestedLoopJoin rule, and it seems this does exactly what we want.

Some questions:

  • Is this rule the right place for looking for a solution, or do you have other suggestions?
  • How can we re-use this rule for the books table, where its access is not allowed by a full table scan (checking the test for this rule in the calcite GitHub repository, it seems that all the used tables allow a full table scan)?

Thanks!

lmarx
  • 476
  • 4
  • 9

0 Answers0