I've created Substrait
query plan using Ibis
, and stored it as .proto
file. Then I successfully executed it using duckDB
after reading the matching tables (following the ibis-substrait tutorial)
Then I read the same tables using pyarrow
and wanted to execute the same plan - and it was failing. I'm wondering what I've done wrong. Here's what I've done.
First, I defined the tables with ibis:
import ibis
ratings = ibis.table(
[
("tconst", "str"),
("averageRating", "str"),
("numVotes", "str"),
],
name="ratings",
)
ratings = ratings.select(
tconst=ratings.tconst,
avg_rating = ratings.averageRating.cast('float'),
num_votes=ratings.numVotes.cast("int")
)
basics = ibis.table(
[
("tconst", "str"),
("titleType", "str"),
("primaryTitle", "str"),
("originalTitle", "str"),
("isAdult", "str"),
("startYear", "str"),
("endYear", "str"),
("runtimeMinutes", "str"),
("genres", "str"),
],
name="basics",
)
basics = basics.filter([basics.titleType == "movie", basics.isAdult == "0"]).select(
"tconst",
"primaryTitle",
"startYear",
)
Then created abstract query plan:
from ibis import _
from ibis_substrait.compiler.core import SubstraitCompiler
topfilms = (
ratings.join(basics, "tconst")
.order_by([_.avg_rating.desc(), _.num_votes.desc()])
.filter(_.num_votes > 200_000)
.limit(10)
)
compiler = SubstraitCompiler()
plan = compiler.compile(topfilms)
with open("topfilms.proto", "wb") as f:
f.write(plan.SerializeToString())
Afterwards, when I read imdb's
ratings and basics table I was able to execute the plan using duckDB:
# duckdb_conn connects to pre-defined database
# with both tables pre-configured
duckdb_conn.install_extension("substrait")
duckdb_conn.load_extension("substrait")
with open("topfilms.proto", "rb") as f:
plan_blob = f.read()
plan_result = duckdb_conn.from_substrait(plan_blob)
# successful results
Then, I loaded the same tables using pyarrow
, read the .proto
file and tried to execute it (following the pyarrow
documentation here:
import pyarrow as pa
import pyarrow.substrait as substrait
pa_ratings = pa.csv.read_csv("../datasets/ratings.csv")
pa_basics = pa.csv.read_csv("../datasets/basics.csv")
def table_provider(names, schema):
if not names:
raise Exception("No names provided")
elif names[0] == "ratings":
return pa_ratings
elif names[1] == "basics":
return pa_basics
else:
raise Exception("Unrecognized table name")
with open("topfilms.proto", "rb") as f:
plan_bytes = f.read()
reader = pa.substrait.run_query(plan_bytes, table_provider=table_provider)
reader.read_all()
and it was failing with:
ArrowNotImplementedError: conversion to arrow::acero::Declaration from Substrait relation fetch
I'm wondering what am I doing wrong.
When I copy-paste the pyarrow simple example it works as expected (so no installations issues I assume). I've also tries using the same json format and casting to bytes as done in the simple example but finally I got the same exception.