3

I have many denormalize tables with 200+ columns. These tables are in SQL Server and they often, if not always, have varchar(100) or nvarchar(100) (string) data types. However, most columns are either ints, floats or other data types. It is not possible for me to go through every table & col and pick and choose data types. Due to many reasons including app compatability, performance, joining and other reasons, I must convert these to proper (or at least close to proper) data types. Is there any tool that I can use? Has anyone create code to accomplish this? It doesn't have to be perfect, but a close match would do.

What have I tried:

  1. I tried to export these tables to Excel, and move back to SQL. It worked but it takes many, many more hours than doing it manually because Excel screws up with your data and converts it to whatever it feels like (think of scientific notations, date from numbers, etc... christ!). It was very time-consuming and failed. If you choose to use "Text" option in Excel, it'll just convert everything back to varchar (x)
  2. I tried exporting to flat file and using VS or SSMS new version that has intelligent data type. This worked better than Excel but unfortunately even a single row of conflicts stops the whole process. The tool is clunky, gives bad errors and doesnt tell you which row caused issues. Using this method is also terrible because these tables are massive and this is very time-consuming. Especially when you consider tool troubleshooting.

thank you for helping. I also appreciate it if you do not ask me to just drop the task by trying to say my setup is bad/etc.

  • Could you clarify, when you say "proper" data types, what do you mean by "proper"? – Nabav Sep 05 '20 at 03:48
  • 1
    @Nabav I think it is clear, but I will be more explicit. If column A has values between 1-20, it should be ideally a small int or int. A colB with 2.4 should be float, not varchar(100). I understand that you cannot have perfect results. I also understanding potential for danger, for example a zipcode with leading 0s will be lost, but at least having reasonaly proper data types is a good start. – Shannon_Colt1999 Sep 05 '20 at 03:54
  • 1
    I did [something like this for integer-based data](https://www.mssqltips.com/sqlservertip/6107/find-sql-server-integer-columns-to-make-skinnier/). But if your starting column is a string, I guess you would start with for any column where even one single value was `ISNUMERIC() = 0`, you eliminate from the possibility of converting to a numeric type, and any value having one `ISDATE() = 0`, you eliminate from date/time candidates. If you have columns where *all* the data is ISNUMERIC, then you can start testing their lengths (max length, max length after the decimal if a decimal is there, etc). – Aaron Bertrand Sep 05 '20 at 04:16
  • @AaronBertrand Thank you! your article is extremely valuable for Integers, too bad it's not for strings! Actually, I think that using ISNUMERIC/ISDATE is really what I should be using. I knew about tehse functions but somehow, never came to mind when thinking of this dreadful task! Thanks again – Shannon_Colt1999 Sep 05 '20 at 04:55
  • 2
    I didn't think of this when I wrote up my answer, but it occurred to me that `TRY_CONVERT()` might be a more precise way to determine if a value should be a certain type, rather then rely on the cowboy antics of ISNUMERIC/ISDATE (they are not super reliable). `CASE WHEN TRY_CONVERT(tinyint, col) IS NULL THEN 'will not fit in tintint' WHEN TRY_CONVERT(int, col) IS NULL THEN 'will not fit in int' etc. etc.` – Aaron Bertrand Sep 05 '20 at 06:04

3 Answers3

6

I'll assume for now you only care about string columns that:

  1. should stay as strings but are perhaps defined wider than they need to be
  2. shouldn't be strings because:
    • they only contain dates
    • they only contain numbers, in which case:
      • you will care about lengths (to determine potential tinyint/int/bigint)
      • you will care about whether they contain decimals

You've seen an approach already for determining if columns already defined as integers could be made smaller, but a similar approach could be used to find potential candidates in a table where the data type is currently a string but it meets one of the criteria above.

Let's say you have a table like this:

CREATE TABLE dbo.foo
(
  a int PRIMARY KEY, 
  h varchar(100),
  i varchar(100),
  j varchar(100)
);

INSERT dbo.foo VALUES 
(1,'123','123','20200101 04:00:00'),
(2,'456','456','20200101'),
(3,'789','789','20200101'),
(4,'867','foo','20200101'),
(5,'876','876','20200101'),
(6,'6.54','654','20200101');

One approach would be to determine all the metadata for the columns as they are defined (which you can get easily from sys.dm_exec_describe_first_result_set), then from that build dynamic SQL to check each column for the longest value (which will determine the smallest string size), whether there is a single non-numeric (which means you can't convert to a number), whether there is a single non-date (which means you can't convert to a date), and whether there is a decimal point (which means you can't convert to an int family, but you'll need to also check precision/scale).

This is absolutely just a rough, dirty kick-start, but it should get you going.

DECLARE @table nvarchar(513) = N'dbo.foo';

DECLARE @sql nvarchar(max) = N'SELECT ', @un nvarchar(max) = N'',
  @un_sub nvarchar(max) = N'
  SELECT ColumnName =  MIN([col $c$]), 
  CurrentType = MIN([type $c$]), 
  LongestValue = MAX([len $c$]), 
  [AllNumerics?] = MIN([is_num $c$]), 
  [AllDates?] = MIN([is_date $c$]),
  [AnyContainDecimal] = MAX([has_dec $c$]) FROM x '

SELECT @sql += N'[col ' + name + '] = ''' + name + ''',
   [type ' + name + '] = '''
  + system_type_name + ''',' + QUOTENAME(name)
  + ', [len ' + name + '] = LEN(' + QUOTENAME(name) + '),
  [is_num ' + name + '] = CONVERT(tinyint, ISNUMERIC(' + QUOTENAME(name) + ')),
  [is_date ' + name + '] = CONVERT(tinyint, ISDATE(' + QUOTENAME(name) + ')),
  [has_dec ' + name + '] = CASE WHEN ISNUMERIC(' + QUOTENAME(name) + ') = 1
    AND ' + QUOTENAME(name) + ' LIKE N''%.%'' THEN 1 ELSE 0 END,',
  @un += N'
UNION ALL ' + REPLACE(@un_sub, N'$c$', name)
  
FROM sys.dm_exec_describe_first_result_set('SELECT * FROM ' + @table, NULL, 1)
WHERE system_type_name like '%char%'

SELECT @sql += N'[$garbage$]='''' FROM ' + @table;

SELECT @sql = N';WITH x AS (
' + @sql + N'
) ' + STUFF(@un, 1, 10, '');

EXEC sys.sp_executesql @sql;

It's a lot to digest... dynamic SQL is powerful but it is really ugly and not exactly object-oriented.

Results (try it out in this fiddle):

enter image description here

You can see there that:

  1. h is all numerics, and the longest value is 4, but at least one value contains a decimal point, so the optimal type here is decimal(something, something).
  2. i contains at least one non-numeric, and at least one non-date, therefore it can only be a string, but since the longest value is only 3 characters, varchar(100) is too much. Whether you can go to varchar(3) or char(3) or you need to future-proof with a little padding, is really only a question you can answer qualitatively given your data model, business requirements now and later, etc.
  3. j contains all date types, but you can't interpret much from the max length here (because you have no idea how the dates are actually stored, since they are stored as strings and strings of many, many forms can be interpreted as a valid date). So you probably know enough to say j should be a datetime of some flavor, but you'll need to look closer at values to get a feel for what's actually there.

You could change the result from this query (especially for tables with lots of columns) to only return values that are worth investigating, in this case I returned all rows to demonstrate (and all rows have potential fixes in my example anyway). Just add another CTE around the union and filter based on those columns (or others you add).

Of course on big tables, this will scan potentially for every column, so don't expect it to be fast, and expect it to disfavor a lot of memory if you're short on that. Also this is probably obvious but this can't protect you from picking a type that will hurt you later. Say the column was collecting integers and it just got to 99, so you change the type to tinyint because there are no decimals and the longest length was 2. Then someone inserts 256 and boom.

You could also add other enhancements, like also get the min length (if they're all strings, maybe you have varchar but it could be char), or check if any characters are outside of ASCII (maybe you have nvarchar but it could be varchar), how many digits on either side of the decimal (to be more precise about decimal types), or the max value (to increase accuracy for determining type of integer). I'll leave those as an exercise.

Aaron Bertrand
  • 272,866
  • 37
  • 466
  • 490
  • Gosh, I never expected two such serious answers to my question! Thank you so much, I really wish I could vote them up, but I don't have enough rep, and I wish I could choose two best answers! I really used them both. Thank you so much for such useful answer! – Shannon_Colt1999 Sep 07 '20 at 03:35
2

You might have a simpler solution in SQL Server. Just try converting the values and choose the most appropriate type. For a single column handling integers, dates, and times is pretty easy:

select (case when count(try_convert(tinyint, col)) = count(col) then 'tinyint'
             when count(try_convert(int, col)) = count(col) then 'int'
             when count(try_convert(bigint, col)) = count(col) then 'bigint'
             when count(try_convert(date, col)) = count(col) then 'date'
             when count(try_convert(time, col)) = count(col) then 'time'
             when count(try_convert(datetime, col)) = count(col) then 'datetime'
             else 'varchar(255)'  -- or whatever default
        end)
from t
where col is not null;

This needs to be expanded in two ways. One is for more columns and the second is for other types of numbers. The first is easy:

select colname,
       (case when count(try_convert(tinyint, col)) = count(col) then 'tinyint'
             when count(try_convert(int, col)) = count(col) then 'int'
             when count(try_convert(bigint, col)) = count(col) then 'bigint'
             when count(try_convert(date, col)) = count(col) then 'date'
             when count(try_convert(time, col)) = count(col) then 'time'
             when count(try_convert(datetime, col)) = count(col) then 'datetime'
             else 'varchar(255)'  -- or whatever default
        end)
from t cross apply
     (values ('col1', col1), ('col2', col2), . . . ) v(colname, col)
where col is not null
group by colname;

Note: If the values are all NULL there is no way to check.

The problem with numbers with decimal points is that the values are ambiguous -- do you want a numeric or float? One possibility is that you have data types in mind. So, you might know that all numerics are likely to be, so numeric(20, 4) because they represent monetary amounts -- and you can just include them above.

Or you might test where the decimal place is and use that information to derive the type. I think the simplest solution might be something like this:

select colname,
       (case when count(try_convert(tinyint, col)) = count(col) then 'tinyint'
             when count(try_convert(int, col)) = count(col) then 'int'
             when count(try_convert(bigint, col)) = count(col) then 'bigint'
             when count(try_convert(date, col)) = count(col) then 'date'
             when count(try_convert(time, col)) = count(col) then 'time'
             when count(try_convert(datetime, col)) = count(col) then 'datetime'
             when count(try_convert(numeric(20, 4), col)) = count(col) and
                  sum(case when col like '%._____' then 1 else 0 end) = 0
             then 'numeric(20, 4)'
             when count(try_convert(float, col) = count(col)
             then 'float'
             else 'varchar(255)'  -- or whatever default
        end)
from t cross apply
     (values ('col1', col1), ('col2', col2), . . . ) v(colname, col)
where col is not null
group by colname;
Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • max(try_) will always return a result when there is a single row which could be converted/cast regardless if all other rows cannot be converted. the query could work with count() instead of max(): `count(try_convert(xyzdatatype, col)) = count(col)` – lptr Sep 05 '20 at 13:12
  • @liptr . . . That is an interesting point. I'll change the logic to handle 1 failure instead. – Gordon Linoff Sep 06 '20 at 02:17
  • @GordonLinoff I never expected two such serious answers to my question! Thank you so much, I really wish I could vote them up, but I don't have enough rep, and I wish I could choose two best answers! I really used them both. Thank you so much for such useful answer! – Shannon_Colt1999 Sep 07 '20 at 03:35
-2

You can consider using a view on the table, where the view uses a select with a conversion function like

select someFunction(colA), someOtherFunction(colB) ... from tableName

e.g. for sqlserver

CREATE VIEW myView
as
select CAST(colA AS int) as colA, CAST(colB AS text) as colB ... 
from tableName

Then you can say select ... from myView

Ari Singh
  • 1,228
  • 7
  • 12
  • Thanks for the answer. But I am not sure how this is going to help. You are converting each and every col manually. Writing this code is what I am trying to avoid. If I have to manually create the statements and guess the data types, it'll be a week of manual work which is what the question is about. – Shannon_Colt1999 Sep 05 '20 at 03:57
  • Unless you can have the computer read your thoughts - you have to specify what type you want to convert to. If you want all columns to be converted to the same type, or using some algorithm - you can write a program to write the SQL query for the view, and have this SQL writer program create the VIEW statements for you and create the views. You might be able to even use Excel to write the view queries. Put the metadata (column names, data types etc.) in Excel and have Excel create the SQL for the views. You can export the table metadata from the database to Excel to get started. – Ari Singh Sep 05 '20 at 04:00
  • Read thoughts? This is really just data sampling, it's not in my mind. – Shannon_Colt1999 Sep 05 '20 at 04:57