Error loading data from S3 bucket to Databricks External Table

Question

Using an example I found online, below code throws error as it cannot read from S3 bucket. Problem is I have to pass in the AWS credentials which is found in variable S3_dir with the bucket path. I am unable to get this to work.

%sql
DROP TABLE IF EXISTS mydb.bigtable;
CREATE TABLE mydb.bigtable
(
 id BIGINT,
  string1 STRING,
  numbers BIGINT,
) USING delta
LOCATION '/user/hive/warehouse/mydb'
SELECT * from delta.f"{S3_dir}";

What is the specific problem? Is it the desire to pass-in the directory name, or are you have problems accessing data from S3? If you hard-coded that directory name, does it work? Also, where are you running this command -- it would appear to be from a Databricks Notebook? — John Rotenstein, Feb 24 '23 at 01:35

score 1 · Accepted Answer · answered Feb 24 '23 at 04:39

In-order to use any data source from a storage(such as AWS S3, Azure ADLS), you need to mount the s3 bucket with Databricks. Here is the step by step procedure:

Step 1: Create AWS Access Key and Secret Key for Databricks

Log into your AWS account
Select security credentials -> under your profile name [top right corner]
Look for Access keys (Scroll down) -> Create and download the generate CSV file

Note: This file contains your access and secret key

Step 2: Login to Databricks and run the following code

Create a notebook called “mount_s3_dbfs”

import urllib

# 1. Add necessary information
ACCESS_KEY = <access_key>
SECRET_KEY = <secret_key>
S3_BUCKET = <s3bucket_name>

# 2. Encode your secret key
ENCODED_SECRET_KEY = urllib.parse.quote(string=SECRET_KEY, safe="")

# 3. Defining URL
URL = "s3a://{}:{}@{}".format(ACCESS_KEY, ENCODED_SECRET_KEY, S3_BUCKET)

# 4. Mounting your S3 Bucket with DBFS
dbutils.fs.mount(URL,f"/mnt/{S3_BUCKET}")

# 5. Print mounted folder content
dbutils.fs.ls(URL,f"/mnt/{S3_BUCKET}/")

Note: Providing your credential directly is not recommended, try uploading it into your DBFS and read it using spark read csv method.

Step 3: Use spark to read and write file from AWS

%sql

DROP TABLE IF EXISTS mydb.bigtable;
CREATE TABLE mydb.bigtable
(
 id BIGINT,
  string1 STRING,
  numbers BIGINT,
) USING delta
LOCATION '/user/hive/warehouse/mydb'
SELECT * FROM delta.`path\to\s3file`;

Note: Use '`' (backtick) to define your s3 file location.

Absolutely helpful, thank you arudsekaberne – Shaggy Feb 25 '23 at 03:14 — Shaggy, Feb 25 '23 at 03:14

Error loading data from S3 bucket to Databricks External Table

1 Answers1

Linked