0

I have a pandas dataframe with a multiindex. How can I reference the indexes in a duck db query?

import duckdb
import pandas as pd
import numpy as np

df = pd.DataFrame({
  'i1': np.arange(0, 100),
  'i2': np.arange(0, 100),
  'c': np.random.randint(0 , 10, 100)
}).set_index(['i1', 'i2'])

>>> duckdb.query('select sum(c) from df group by i1')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: Binder Error: Referenced column "i1" not found in FROM clause!
Candidate bindings: "df.c"
fny
  • 31,255
  • 16
  • 96
  • 127

1 Answers1

1

not an answer (since I'm looking for one as well), but may still help. I think DuckDB may not recognize any index. If you do this:

rel = conn.from_df(df)
rel.create("a_table")

result = conn.execute("select * from a_table").fetch_df()

You will see that the result data frame only has the c column without i1 or i2. The same is true if you have just a simple index, not a multi-index. The workaround is to simply reset the index before you use the data frame in DuckDB.

See the official issue discussion as well: https://github.com/duckdb/duckdb/issues/1011

Steinway Wu
  • 1,288
  • 1
  • 12
  • 18