I'm using duckdb version 0.8.0
I have a CSV file located in google storage gs://some_bucket/some_file.csv
and want to load this using duckdb.
In pandas I can do pd.read_csv("gs://some_bucket/some_file.csv")
, but this doesn't seem to work in duckdb. I see that there's some documentation here: https://duckdb.org/docs/guides/import/s3_import.html, but I find that confusing as it's mainly aimed at s3
usage.
I guess that I have to run:
duckdb.sql("INSTALL httpfs;")
duckdb.sql("LOAD httpfs;")
From the documentation, I'm not sure what the parameters for :
SET s3_access_key_id='key_id';
SET s3_secret_access_key='access_key';
Would be.
How do I load a csv from google storage in duckdb?
Edit - approaches which haven't worked
I've added hmac
keys and downloaded them following guide here: https://cloud.google.com/storage/docs/authentication/managing-hmackeys#gsutil_1
import duckdb
import os
duckdb.sql("LOAD httpfs;")
hmac_access = os.getenv('GOOGLE_HMAC_ACCESS_ID')
hmac_secret = os.getenv('GOOGLE_HMAC_SECRET')
duckdb.sql(f"SET s3_access_key_id='{hmac_access}';")
duckdb.sql(f"SET s3_secret_access_key='{hmac_secret}';")
################################################################################
# approach 1
# Doesn't work - fails with:
#
# Traceback (most recent call last):
# File "duck_test.py", line 18, in <module>
# duckdb.sql("SELECT * FROM '{gcp_path_1}'").show()
# duckdb.CatalogException: Catalog Error: Table with name {gcp_path_1} does not exist!
# Did you mean "pg_am"?
# duckdb.sql(f"SELECT * FROM '{gcp_path_1}'").show()
################################################################################
# approach 2
# Fails with:
# Traceback (most recent call last):
# File "duck_test.py", line 32, in <module>
# duckdb.sql(f"SELECT * from read_csv('{gcp_path_1}', AUTO_DETECT=TRUE);")
# duckdb.HTTPException: HTTP Error: HTTP GET error on 'https://some_bucket.s3.amazonaws.com/some_file.csv' (HTTP 400)
duckdb.sql(f"SELECT * from read_csv('{gcp_path_1}', AUTO_DETECT=TRUE);")
Edit (working)
In the code above I forgot to set
duckdb.sql("SET s3_endpoint='storage.googleapis.com'")
After setting this both approaches read from storage.