0

I have a table like this:

Create table landingzone.classes(
teacher nvarchar(255)
,moment nvarchar(255)
,noOfStudents nvarchar(255)
,scheduledYesNo nvarchar(255)
)

INSERT INTO [landingzone].[classes]
           ([teacher]
           ,[moment]
           ,[noOfStudents]
           ,[scheduledYesNo]) 
Select ' Henrov', '  01/07/2021' ,6 ,'True' 
union all 
Select ' Bill', '  01/05/2021' ,6 ,'False' 
union all
Select ' Henrov', '  31/07/2021' ,NULL ,NULL

What I need is a script that finds out that

  • [teacher] should be a nvarchar(255)
  • [moment] should be a timestamp
  • [noOfStudents] should be an int
  • [ScheduledYesNo] should be an bit (or an int with 1/0 as value)

and creates a table like this:

Create table staging.classes(
teacher nvarchar(255)
,moment datetime
,noOfStudents int
,scheduledYesNo bit
)

followed by a datatransfer from landingzone.classes to staging.classes.

However, it should do this by analyzing the table dbo.test, not by referring to some config tables containing the names of the columns with associated datatype. Due to the possible large numbers of records in dbo.test where a lot of fields can be empty, it should look past the first 200 (preferably this number should be configurable)

The tables in the landingzone are delivered by other processes and should explicitly keep existing in the current form (business rule).

The challenge I think is mostly in autodiscovering the proper datatypes.

Can this be done in BIML?

Henrov
  • 1,610
  • 1
  • 24
  • 52
  • What do you mean by "I need is a script that finds out that..." ? From where you are getting your data ? – Amira Bedhiafi Apr 14 '21 at 07:54
  • Have you tried SQL's TRY_PARSE or TRY_CONVERT? – Niels Berglund Apr 14 '21 at 08:19
  • 1
    You should be doing the fact finding on your data *before* you create the table, not after. – Thom A Apr 14 '21 at 08:19
  • @AmiraBedhiafi The tables are delivered by a different proces. Fact is: these tables exist and should keep existing. For followup processes however I need them to have proper datatypes. – Henrov Apr 14 '21 at 09:31
  • @NielsBerglund I do know how to do this in SQL, SSIS and even BIML. The quesytion is mostly about automagically discovering the proper datatytpes based on existing data. – Henrov Apr 14 '21 at 09:31
  • @Larnu These tables are delivered by a different proces and consumed by other processes that I cannot change. – Henrov Apr 14 '21 at 09:32
  • 2
    I would, personally, therefore suggest *you* take the time to learn the data and then create the appropriate tables for your latter consumption. I don't doubt that there *are* tools out there that attempt to interpret the data types but I also don't doubt that they are flawless (and tool recommendation is way off topic for Stack Overflow). Take JET/ACE as an example; they are literally **awful** at it. – Thom A Apr 14 '21 at 09:43
  • 1
    Given the existing constraints you can only hope to have some consistency in your input strings. For example if your "string dates" are always in the form "dd/MM/yyyy" you can apply some logic. Numbers have another problem, string integers should be different from string decimals because the latter have ALWAYS a decimal separator in them. Boolean should be with true/false or 1/0. Indeed what @Larnu says is that you need to have a confidence in how consistent these input strings are formatted before even starting the translation but you will never be certain of the correct interpretation – Steve Apr 14 '21 at 10:28

1 Answers1

1

There's no Biml method that can help you detect data type.

I was on a long term project ingesting typeless data from a mainframe and we took a similar approach to what you're doing. We landed data as-is into a table with the widest allowable string type from that system*

We wrote a TSQL script that would unpivot the table and then generate N columns that do analysis on the data. A series of tests would look like (free hand coding so go with the spirit not the letter) :

SELECT
  MIN(LEN([teacher]))
, MAX(LEN([teacher]))
, COUNT_BIG(DISTINCT([teacher]))
, COUNT_BIG(WHEN NULLIF(LTRIM(RTRIM([teacher], ''))) IS NOT NULL THEN 1 END AS NotNullOrEmpty
, COUNT_BIG(WHEN NULLIF(LTRIM(RTRIM([teacher], ''))) IS NULL THEN 1 END AS NullOrEmpty
, COUNT_BIG(CASE WHEN TRY_CONVERT('bigint', [teacher]) IS NULL THEN 1 END) AS NotBigInt
, COUNT_BIG(CASE WHEN TRY_CONVERT('bigint', [teacher]) IS NOT NULL THEN 1 END) AS BigInt

The purpose of that was to generate descriptive statistics on what the column looked like - how often was it populated, what percentage of it could fit into the various data types, min/max lengths

Gotchas we ran into.

No data or sparsely populated data. That was easily the biggest bite in our backsides. If I recall correctly, NULLs would happily cast into whole number types so we had lots of correcting to do after the fact once we had sufficient data. Save yourself some grief, if you don't have much data, leave it as string ;)

Local data rules. We ran into things that canned scripts wouldn't handle. The mainframe used a caret ^ to signal end of time

Dates and times. Model204 could handle March 32. Clearly that's the same as April 1, right? Same with March 31 24:30 is clearly April 1 at 00:30

*"Oh, well yeah we do have these fields where we can store binary/very long strings. Didn't we tell you about those?"

billinkc
  • 59,250
  • 9
  • 102
  • 159
  • I will beusing BIML to create the tables and packages to move stuff from landingzone to staging. After that I will manually correct the datatypes in the staging schema and adapt the packages (created by Biml) to the proper datatypes. Not fully automagic but should save some time – Henrov Apr 15 '21 at 07:25