U-SQL Schema Discovery

Question

The Data Lake approach (according to slide 5 here) is:

Ingest all data - regardless of requirements
Store all data - in native format without schema definition
Do analysis - using engines like Hadoop

But let's say we have loaded up many many datasets to our data lake, how do I go about schema discovery in an automated and scalable manner? Does U-SQL support dynamic schema discovery or what would be a good way to go about it using ADLA or other toolset?

score 1 · Accepted Answer · answered Aug 17 '17 at 23:57

This is a good question but the answer somewhat depends on the schema you want to discover.

Let me explain:

If you have CSV type data, there are tools, including the latest version of the ADL Tools for VisualStudio that will try to detect your schema from the provided data (the tools actually will generate the EXTRACT statement for you).

Some interactive languages may also give you extractors that try to infer the schema as part of the query. We do not support this in U-SQL at the moment, because you do not want to have a batch job infer the schema wrongly and fail after spending possibly a lot of money to run the job. In an interactive setting, it is less costly and can be easily corrected/overwritten by the query author.

If you have however data such as images or text documents and even nested, semistructured documents like JSON or XML, often the schema that you want has to be provided. E.g., if you have a JPEG file, do you want the EXIF properties? If so which ones? Or some feature extraction? Or some color analysis? etc.

So I think one thing that is important when designing a data lake is to have some semantically meaningful organization of the native-format data into folder structures and either use Views/TVFs to provide the schematized view(s) in the meta data service to make them more easily discoverable, or use a service like Azure Data Catalog to describe the data.

If you have already data inside the lake's storage and you want to discover it, right now you would have to build some form of discovery with U-SQL and the SDKs or some tooling that goes against the WebHDFS APIs of the store.

U-SQL Schema Discovery

1 Answers1