1

I'm using duckdb version 0.8.0

I have a CSV file located in google storage gs://some_bucket/some_file.csv and want to load this using duckdb.

In pandas I can do pd.read_csv("gs://some_bucket/some_file.csv"), but this doesn't seem to work in duckdb. I see that there's some documentation here: https://duckdb.org/docs/guides/import/s3_import.html, but I find that confusing as it's mainly aimed at s3 usage.

I guess that I have to run:

duckdb.sql("INSTALL httpfs;")
duckdb.sql("LOAD httpfs;")

From the documentation, I'm not sure what the parameters for :

SET s3_access_key_id='key_id';
SET s3_secret_access_key='access_key';

Would be.

How do I load a csv from google storage in duckdb?

Edit - approaches which haven't worked

I've added hmac keys and downloaded them following guide here: https://cloud.google.com/storage/docs/authentication/managing-hmackeys#gsutil_1

import duckdb
import os

duckdb.sql("LOAD httpfs;")

hmac_access = os.getenv('GOOGLE_HMAC_ACCESS_ID')
hmac_secret = os.getenv('GOOGLE_HMAC_SECRET')

duckdb.sql(f"SET s3_access_key_id='{hmac_access}';")
duckdb.sql(f"SET s3_secret_access_key='{hmac_secret}';")


################################################################################
# approach 1

# Doesn't work - fails with:
# 
# Traceback (most recent call last):
#   File "duck_test.py", line 18, in <module>
#     duckdb.sql("SELECT * FROM '{gcp_path_1}'").show()
# duckdb.CatalogException: Catalog Error: Table with name {gcp_path_1} does not exist!
# Did you mean "pg_am"?

# duckdb.sql(f"SELECT * FROM '{gcp_path_1}'").show()


################################################################################
# approach 2

# Fails with:

# Traceback (most recent call last):
#   File "duck_test.py", line 32, in <module>
#     duckdb.sql(f"SELECT * from read_csv('{gcp_path_1}', AUTO_DETECT=TRUE);")
# duckdb.HTTPException: HTTP Error: HTTP GET error on 'https://some_bucket.s3.amazonaws.com/some_file.csv' (HTTP 400)

duckdb.sql(f"SELECT * from read_csv('{gcp_path_1}', AUTO_DETECT=TRUE);")

Edit (working)

In the code above I forgot to set

duckdb.sql("SET s3_endpoint='storage.googleapis.com'")

After setting this both approaches read from storage.

baxx
  • 3,956
  • 6
  • 37
  • 75
  • check https://cloud.google.com/storage/docs/authentication/managing-hmackeys – Puteri May 20 '23 at 22:56
  • 1
    Your approach one is failing with a catalog error because you forgot to add the `f` prefix to the f-string for the SQL string – Mause May 21 '23 at 15:40

1 Answers1

3

To access your files, you need to use s3 as the URL prefix.

  1. If your GCS bucket is open to the public
duckdb.sql('INSTALL httpfs')
duckdb.sql('LOAD httpfs')

duckdb.sql("SELECT * FROM 's3://some_bucket/some_file.csv'").show()

  1. If your GCS bucket is not public. First, you need to set an HMAC key. Go to google cloud console and follow the instructions. Then, execute the following commands
duckdb.sql('INSTALL httpfs')
duckdb.sql('LOAD httpfs')
duckdb.sql("SET s3_endpoint='storage.googleapis.com'")

# You will obtain the key_id from the previous step of 
# configuring settings in the Google Console.
duckdb.sql("SET s3_access_key_id='<your-key-id>'") 

# You will obtain the secret_access_key from the previous step of 
# configuring settings in the Google Console.
duckdb.sql("SET s3_secret_access_key='<your-key>'") 

duckdb.sql("SELECT * FROM 's3://some_bucket/some_file.csv'").show()
Douenergy
  • 46
  • 1
  • Running that gives the error: `duckdb.CatalogException: Catalog Error: Table with name {gcp_path_1} does not exist!`, here `gcp_path_1` is set to: `s3://storage_bucket/some_file.csv` (I've changed `storage_bucket/some_file.csv` to a path that exists in my storage though) – baxx May 21 '23 at 15:05
  • I've updated the OP with context on the failure – baxx May 21 '23 at 15:11