1

I am ingesting a csv file from a mounted blob storage to delta live table and here's my initial query:

CREATE INCREMENTAL LIVE TABLE table_raw
COMMENT "Ingesting data from /mnt/foo"
TBLPROPERTIES ("quality" = "bronze")
AS
SELECT * FROM cloud_files("/mnt/foo/", "csv")

I need/want to transform its column names (e.g. remove punctuations, trim whitespaces, etc.). How should I approach it?

1 Answers1

0

First of all, it should be said that data cleansing should only be done in the silver layer. In the bronze layer you explicitly want to leave the data unchanged.

If you now want to apply some data preprocessing or column renaming you could just use common SQL syntax in combination with pyspark SQL functions (see here)

E.g.:

CREATE INCREMENTAL LIVE TABLE table_silver
COMMENT "Ingesting data from table_raw"
TBLPROPERTIES ("quality" = "silver")
AS
SELECT rtrim(a) AS ColumnName1
rtrim(b) AS ColumnName2
....
FROM LIVE.table_raw
  • the question about cleaning up the column names, while your code is for cleaning up the data itself – Alex Ott Mar 25 '23 at 12:58