0

I've created Substrait query plan using Ibis, and stored it as .proto file. Then I successfully executed it using duckDB after reading the matching tables (following the ibis-substrait tutorial)

Then I read the same tables using pyarrow and wanted to execute the same plan - and it was failing. I'm wondering what I've done wrong. Here's what I've done.

First, I defined the tables with ibis:

import ibis

ratings = ibis.table(
    [
        ("tconst", "str"),
        ("averageRating", "str"),
        ("numVotes", "str"),
    ],
    name="ratings",
)

ratings = ratings.select(
    tconst=ratings.tconst,
    avg_rating = ratings.averageRating.cast('float'),
    num_votes=ratings.numVotes.cast("int")
)

basics = ibis.table(
    [
        ("tconst", "str"),
        ("titleType", "str"),
        ("primaryTitle", "str"),
        ("originalTitle", "str"),
        ("isAdult", "str"),
        ("startYear", "str"),
        ("endYear", "str"),
        ("runtimeMinutes", "str"),
        ("genres", "str"),
    ],
    name="basics",
)

basics = basics.filter([basics.titleType == "movie", basics.isAdult == "0"]).select(
    "tconst",
    "primaryTitle",
    "startYear",
)


Then created abstract query plan:

from ibis import _
from ibis_substrait.compiler.core import SubstraitCompiler

topfilms = (
    ratings.join(basics, "tconst")
    .order_by([_.avg_rating.desc(), _.num_votes.desc()])
    .filter(_.num_votes > 200_000)
    .limit(10)
)

compiler = SubstraitCompiler()
plan = compiler.compile(topfilms)

with open("topfilms.proto", "wb") as f:
    f.write(plan.SerializeToString())

Afterwards, when I read imdb's ratings and basics table I was able to execute the plan using duckDB:

# duckdb_conn connects to pre-defined database
# with both tables pre-configured

duckdb_conn.install_extension("substrait")
duckdb_conn.load_extension("substrait")

with open("topfilms.proto", "rb") as f:
    plan_blob = f.read()

plan_result = duckdb_conn.from_substrait(plan_blob)
# successful results

Then, I loaded the same tables using pyarrow, read the .proto file and tried to execute it (following the pyarrow documentation here:

import pyarrow as pa
import pyarrow.substrait as substrait

pa_ratings = pa.csv.read_csv("../datasets/ratings.csv")
pa_basics = pa.csv.read_csv("../datasets/basics.csv")

def table_provider(names, schema):
    if not names:
       raise Exception("No names provided")
    elif names[0] == "ratings":
       return pa_ratings
    elif names[1] == "basics":
       return pa_basics
    else:
       raise Exception("Unrecognized table name")

with open("topfilms.proto", "rb") as f:
    plan_bytes = f.read()

reader = pa.substrait.run_query(plan_bytes, table_provider=table_provider)
reader.read_all()

and it was failing with:

ArrowNotImplementedError: conversion to arrow::acero::Declaration from Substrait relation fetch

I'm wondering what am I doing wrong.

When I copy-paste the pyarrow simple example it works as expected (so no installations issues I assume). I've also tries using the same json format and casting to bytes as done in the simple example but finally I got the same exception.

Omri
  • 43
  • 1
  • 4
  • It might be the `.limit(10)` that gets translated into a "fetch" relation, and this step of the substrait plan is not yet implemented by the pyarrow (acero) substrait consumer. – joris Jun 01 '23 at 11:02
  • Thanks for the comment. I removed the limit, and it was successful again with duckDB, but I encountered an ArrowNotImplementedError when trying to use arrow. It's a bit strange since I have seen a couple of talks claiming that this integration had already been implemented... – Omri Jun 01 '23 at 12:13
  • As of Arrow 11, the Arrow substrait consumer did not have support for sorting (`order_by`) -- I don't know if that has been implemented yet. – Gil Forsyth Jun 01 '23 at 13:24
  • Thanks! I'm using arrow 12 and did try to scroll through the code but wasn't able to find its current implementation status. I guess this is still the issue if so.. – Omri Jun 01 '23 at 13:31

0 Answers0