DuckDB python API: query composition

Question

Suppose I use DuckDB with python, for querying an Apache parquet file test.pq with a table containing two columns f1 and f2.

r1 = duckdb.query("""
SELECT f1 FROM parquet_scan('test.pq') WHERE f2 > 1
""")

Now I would like to use r1 result in another query, like:

duckdb.query("""SELECT * FROM r1 WHERE f1 > 10""")

However, last instruction gives: RuntimeError: Catalog Error: Table with name r1 does not exist!

Am I missing a DuckDB method equivalent to Apache Spark registerTempTable() ?

score 3 · Accepted Answer · answered Jul 30 '21 at 12:54

r1 = duckdb.query("""
SELECT f1 FROM parquet_scan('test.pq') WHERE f2 > 1
""")

Does not create a table called r1, but actually creates a relation that is nothing else than an execution plan. Hence if you call an execute on that, would actually execute a query that scans the parquet file

result = r1.execute()

If you want to query it as a table, you basically have two options.

You create a view from your relation

r1.create_view('table_name')

You change your SQL query to create a duckdb table

conn = duckdb.connect()
conn.execute("create table t as SELECT f1 FROM parquet_scan('test.pq') where f2 > 1 ")

Note that in 1 you will actually load the parquet data to a Duck table, while with 2 you will be constantly reading the parquet data itself.

Finally, if you just want to stack up filters, then you could do:

r2 = r1.filter("f1>10")

There is more info on the Python Relational Api on Duckdb's website, more specifically at: https://duckdb.org/docs/api/python

https://github.com/duckdb/duckdb/blob/master/examples/python/duckdb-python.py

Hopefully, that was helpful! ;-)

Isn't it exactly the other way around, creating a view as in 1 will read the parquet again and again for each query, and 2 will create and in-memory duckdb table and query that? Otherwise, this would completely counter to the intuition I have for a "view". — Wisperwind, Jul 06 '22 at 09:35

DuckDB python API: query composition

1 Answers1