3

I am writing a quick-and-dirty application to load sales plan data into SQL Server (2008 FWIW, though I don't think the specific version matters).

The data set is the corporate sales plan: a few thousand rows of Units, Dollars and Price for each combination of customer, part number and month. This data is updated every few weeks, and it's important to track who changed it and what the changes were.

--  Metadata columns are suffixed with ' ##', to enable an automated 
--  tool I wrote to handle repetitive tasks such as de-duplication of 
--  records whose values didn't change in successive versions of the 
--  forecast.  

CREATE TABLE [SlsPlan].[PlanDetail]
(
    [CustID] [char](15) NOT NULL,
    [InvtID] [char](30) NOT NULL,
    [FiscalYear] [int] NOT NULL,
    [FiscalMonth] [int] NOT NULL,
    [Version Number ##] [int] IDENTITY(1,1) NOT NULL,

    [Units] [decimal](18, 6) NULL,
    [Unit Price] [decimal](18, 6) NULL,
    [Dollars] [decimal](18, 6) NULL,

    [Batch GUID ##] [uniqueidentifier] NOT NULL,
    [Record GUID ##] [uniqueidentifier] NOT NULL DEFAULT (NEWSEQUENTIALID()),
    [Time Created ##] [datetime] NOT NULL,
    [User ID ##] [varchar](64) NULL DEFAULT (ORIGINAL_LOGIN()),

    CONSTRAINT [PlanByProduct_PK] PRIMARY KEY CLUSTERED 
        ([CustID], [InvtID], [FiscalYear], [FiscalMonth], [Version Number ##])
)

To track changes, I'm using an IDENTITY column as part of the primary key to enable multiple version with the same primary key. To track who did the change, and also to enable backing out an entire bad update if someone does something completely stupid, I am inserting the Active Directory logon of the creator of that version of the record, a time stamp, and two GUIDs.

The "Batch GUID" column should be the same for all records in a batch; the "Record GUID" column is obviously unique to that particular record and is used for de-duplication only, not for any sort of query.

I would strongly prefer to generate the batch GUID inside a query rather than by writing a stored procedure that does the obvious:

DECLARE @BatchGUID UNIQUEIDENTIFIER = NEWID()

INSERT INTO MyTable 
SELECT I.*, @BatchGUID
FROM InputTable I

I figured the easy way to do this is to construct a single-row result with the timestamp, the user ID and a call to NEWID() to create the batch GUID. Then, do a CROSS JOIN to append that single row to each of the rows being inserted. I tried doing this a couple different ways, and it appears that the query execution engine is essentially executing the GETDATE() once, because a single time stamp appears in all rows (even for a 5-million row test case). However, I get a different GUID for each row in the result set.

The below examples just focus on the query, and omit the insert logic around them.

WITH MySingleRow AS 
(
    Select              NewID() as [Batch GUID ##], 
                        ORIGINAL_LOGIN() as [User ID ##],
                        getdate() as [Time Created ##]
)
SELECT                  N.*, R1.*
FROM                    util.zzIntegers N
    CROSS JOIN          MySingleRow R1
WHERE                   N.Sequence < 10000000

In the above query, "util.zzIntegers" is just a table of integers from 0 to 10 million. The query takes about 10 seconds to run on my server with a cold cache, so if SQL Server were executing the GETDATE() function with each row of the main table, it would certainly have a different value at least in the milliseconds column, but all 10 million rows have the same timestamp. But I get a different GUID for each row. As I said before, the goal is to have the same GUID in each row.

I also decided to try a version with an explicit table value constructor in hopes that I would be able to fool the optimizer into doing the right thing. I also ran it against a real table rather than a relatively "synthetic" test like a single-column list of integers. The following produced the same result.

WITH AnotherSingleRow AS
(
    SELECT              SingleRow.* 
    FROM                (
                            VALUES (NewID(), Original_Login(), getdate())
                        ) 
                        AS SingleRow(GUID, UserID, TimeStamp)
)

SELECT                  R1.*, S.*
FROM                    SalesOrderLineItems S
    CROSS JOIN          AnotherSingleRow R1

The SalesOrderLineItems is a table with 6 million rows and 135 columns, to make doubly sure that runtime was sufficiently long that the GETDATE() would increment if SQL Server were completely optimizing away the table value constructor and just calling the function each time the query runs.

I've been lurking here for a while, and this is my first question, so I definitely wanted to do good research and avoid criticism for just throwing a question out there. The following questions on this site deal with GUIDs but aren't directly relevant. I also spent a half hour searching Google with various combinations of phrases didn't seem to turn up anything.

  • Azure actually does what I want, as evidenced in the following question I turned up in my research: Guid.NewGuid() always return same Guid for all rows. However, I'm not on Azure and not going to go there anytime soon.

  • Someone tried to do the same thing in SSIS (How to insert the same guid in SSIS import) but the answer to that query came back that you generate the GUID in SSIS as a variable and insert it into each row. I could certainly do the equivalent in a stored procedure but for the sake of elegance and maintainability (my colleagues have less experience with SQL Server queries than I do), I would prefer to keep the creation of the batch GUID in a query, and to simplify any stored procedures as much as possible.

BTW, my experience level is 1-2 years with SQL Server as a data analyst/SQL developer as part of 10+ years spent writing code, but for the last 20 years I've been mostly a numbers guy rather than an IT guy. Early in my career, I worked for a pioneering database vendor as one of the developers of the query optimizer, so I have a pretty good idea what a query optimizer does, but haven't had time to really dig into how SQL Server does it. So I could be completely missing something that's obvious to others.

Thank you in advance for your help.

Brent
  • 41
  • 7
  • Personally, I would say that using a variable is the cleanest way; like you intially showed. If you don't want to do that, I *assume* that the data is coming from a parametrised statement from an application? If so, why not have the application generate the GUID and pass that as a parameter? – Thom A Jun 21 '19 at 13:23
  • @Larnu, thanks for the reply. At the risk of labeling myself as a serious loser, I must confess that front end for the app is Excel until I can get around to writing a real app, which could be months from now. I am trying to keep the Excel VBA code as simple as possible and to encode as much logic as possible in the database, because Excel->SQL is very fragile, and for maximizing reuse when I write the "real" app. I can generate a GUID in excel, but if somebody presses "save" twice and the GUID isn't regenerated properly, I get a single batch with duplicate records, which is bad. – Brent Jun 21 '19 at 13:29
  • But if they click the button twice, the code would be run twice, and thus 2 separate GUIDs. – Thom A Jun 21 '19 at 13:32
  • Actually, not always. The GUID is generated by a user-written Excel function so it's visible in the spreadsheet for debugging purposes. I've made it a non-volatile formula, so it's not always recalculated (there's a notable UI performance hit if the function is declared volatile: every time the user enters a cell, it recalcs times 10,000 rows). If I embedded the GUID generation in the VBA data loader code, it would most definitely work as you say. Thanks for the follow-up reply. – Brent Jun 21 '19 at 13:42
  • That doesn't actually change my comment, a new GUID would be run every time the code is run; just that you aren't rerunning that specific code to generate the GUID again. – Thom A Jun 21 '19 at 13:50
  • `NEWID()` is special among functions in that it will produce a new value per result set row (`CRYPT_GEN_RANDOM()` is another). Almost all other T-SQL functions (including `GETDATE()` and notoriously `RAND()`) do not behave that way and will only be evaluated once per result set, regardless of how you squeeze them into the statement. As long as your interface allows supplying a batch, rather than a query, the variable approach will work -- you don't need a sproc just to declare and use a variable. – Jeroen Mostert Jun 21 '19 at 14:46
  • In a single query, `SELECT * FROM (VALUES (NEWID())) G(x) FULL OUTER JOIN RealTable ON 1 = 1` works for me. The kicker is that I have no idea if, technically, the optimizer is guaranteed to then evaluate the `NEWID()` only once, or if this is just an artifact that might disappear on the next version/phase of the moon. It also interacts quite badly with the rest of the query; you'll want to use a subquery in place of `RealTable` rather than have any `WHERE` or `ORDER BY` on the outer level. – Jeroen Mostert Jun 21 '19 at 14:50

0 Answers0