1

I'm playing around with the databricks delta live tables feature using the sql api. This is my statement so far:

--Create Bronze Landing zone table
CREATE STREAMING LIVE TABLE raw_data
COMMENT "mycomment"
TBLPROPERTIES ("quality" = "bronze")
AS 
SELECT * FROM cloud_files('/mnt/path/here','csv', 
map(
'cloudFiles.inferColumnTypes','true',
'option.recursiveFileLookup','true', 
'header','false',
))

My data that its reading doesn't have headers, but I want to let it infer the data type using the inferSchema option. But I need to somehow provide it at least the column names. There's an option to give it an explicit schema but seeing as this is a landing table I want the burden of this type of overhead to be minimized. The spark documentation is pretty sparse as is and the databricks documentation is even worse. Does anyone know if I can do this?

Alex Ott
  • 80,552
  • 8
  • 87
  • 132
Jamalan
  • 482
  • 4
  • 15
  • Hi @Jamalan, please refer this [link](https://www.youtube.com/watch?v=BIxwoO65ylY&t=194s) and [GitHub - databricks/delta-live-tables-notebooks](https://github.com/databricks/delta-live-tables-notebooks). – B. B. Naga Sai Vamsi May 17 '22 at 11:25
  • @BhanunagasaiVamsi-MT , sorry I'm not sure how that helps?... Is there a specific notebook in there that shows I can infer the data types but supply the column names? – Jamalan May 17 '22 at 11:35

1 Answers1

0

Note - When you update a Delta table schema, streams that read from that table terminate. If you want the stream to continue you must restart it.

enter image description here

Delta Lake can automatically update the schema of a table as part of a DML transaction (either appending or overwriting) and make the schema compatible with the data being written.

Delta Lake automatically validates that the schema of the DataFrame being written is compatible with the schema of the table. Delta Lake uses the following rules to determine whether a write from a DataFrame to a table is compatible:

• All DataFrame columns must exist in the target table. If there are columns in the DataFrame not present in the table, an exception is raised. Columns present in the table but not in the DataFrame are set to null.

• DataFrame column data types must match the column data types in the target table. If they don’t match, an exception is raised.

• DataFrame column names cannot differ only by case. This means that you cannot have columns such as “Foo” and “foo” defined in the same table. While you can use Spark in case sensitive or insensitive (default) mode, Parquet is case sensitive when storing and returning column information. Delta Lake is case-preserving but insensitive when storing the schema and has this restriction to avoid potential mistakes, data corruption, or loss issues. Delta Lake support DDL to add new columns explicitly and the ability to update schema automatically.

If you specify other options, such as partitionBy, in combination with append mode, Delta Lake validates that they match and throws an error for any mismatch. When partitionBy is not present, appends automatically follow the partitioning of the existing data.

Refer this link for more information

Abhishek K
  • 3,047
  • 1
  • 6
  • 19