1

Suppose I use DuckDB with python, for querying an Apache parquet file test.pq with a table containing two columns f1 and f2.

r1 = duckdb.query("""
SELECT f1 FROM parquet_scan('test.pq') WHERE f2 > 1
""")

Now I would like to use r1 result in another query, like:

duckdb.query("""SELECT * FROM r1 WHERE f1 > 10""")

However, last instruction gives: RuntimeError: Catalog Error: Table with name r1 does not exist!

Am I missing a DuckDB method equivalent to Apache Spark registerTempTable() ?

M. Page
  • 2,694
  • 2
  • 20
  • 35

1 Answers1

3
r1 = duckdb.query("""
SELECT f1 FROM parquet_scan('test.pq') WHERE f2 > 1
""") 

Does not create a table called r1, but actually creates a relation that is nothing else than an execution plan. Hence if you call an execute on that, would actually execute a query that scans the parquet file

result = r1.execute()

If you want to query it as a table, you basically have two options.

  1. You create a view from your relation
r1.create_view('table_name')
  1. You change your SQL query to create a duckdb table
conn = duckdb.connect()
conn.execute("create table t as SELECT f1 FROM parquet_scan('test.pq') where f2 > 1 ")

Note that in 1 you will actually load the parquet data to a Duck table, while with 2 you will be constantly reading the parquet data itself.

Finally, if you just want to stack up filters, then you could do:

r2 = r1.filter("f1>10")

There is more info on the Python Relational Api on Duckdb's website, more specifically at: https://duckdb.org/docs/api/python

https://github.com/duckdb/duckdb/blob/master/examples/python/duckdb-python.py

Hopefully, that was helpful! ;-)

Pedro Holanda
  • 261
  • 1
  • 3
  • Isn't it exactly the other way around, creating a view as in 1 will read the parquet again and again for each query, and 2 will create and in-memory duckdb table and query that? Otherwise, this would completely counter to the intuition I have for a "view". – Wisperwind Jul 06 '22 at 09:35