0

In SQL Server (or it could be other SQL language I guess), how do you make sure that a table always verifies some complex constraint. For instance, imagine a table has only column A and B

The constraint I want to enforce, is that all lines that contain the same value for column A (e.g. 'a'), also have the same value for column B (e.g. 'b'). For instance, this isn't allowed:

A B C
a1 b c1
a1 c c2
a2 d c3

But this is allowed:

A B C
a1 b c1
a1 b c2
a2 d c3

This is an example of constraint, but there could be more complex (i.e. depends on more columns etc). Here I think that this constraint could be expressed in terms of primary key etc, but let's assume it can't.

What is the way to enforce this?

What I can think about is, each time I want to do an insert, I instead go through a stored procedure. The stored procedure would do an atomic transaction and check if the table would verify the constraint if updated, and "reject" the update if the constraint would be violated, but this seems quite complex.

Mark Rotteveel
  • 100,966
  • 191
  • 140
  • 197
lezebulon
  • 7,607
  • 11
  • 42
  • 73
  • 1
    Your logic isn't clear, please show an example which matches the actual level of complexity you really need. And please show data that does meet the constraint and doesn't. And please show what you have tried in terms of [creating a constraint](https://learn.microsoft.com/en-us/sql/relational-databases/tables/unique-constraints-and-check-constraints?view=sql-server-ver15). If that isn't possible then a trigger is the next option. – Dale K Nov 17 '21 at 21:16
  • 5
    It also strikes me that having such a constraint between rows implies you are storing your data incorrectly... – Dale K Nov 17 '21 at 21:18
  • @DaleK I updated the example – lezebulon Nov 17 '21 at 21:27
  • 3
    This screams of a having a lack of proper normalization to me. – Sean Lange Nov 17 '21 at 21:47
  • @SeanLange this is just an example you can assume that the "business rule" I want to enforce is something more like "having at most 3 distinct values for C column" for instance. Due to potential race conditions, I don't see how it's possible to enforce this other than at SQL level. Ie my application could just try to enforce this condition when calling INSERT, but then nothing prevents an incorrect direct INSERT from breaking the table constraint – lezebulon Nov 17 '21 at 21:55
  • 2
    @lezebulon What do race-conditions have to do with anything? And even if this is a business/domain rule, it still indicates that your DB is very likely not normalized (though we can't say for certain until and unless you show us your `CREATE TABLE` statement - and I recognize that some data-modelling scenarios do not translate well to RDBMS+SQL, but again, we can't say if that applies to you or not without seeing your **actual** DB design). – Dai Nov 17 '21 at 22:03
  • @Dai let's take the example of "having at most 3 distinct values for C column" I could try to enforce this at application level: - checking if I currently have < 3 distinct values for C - if so, do the insert in the table But between step 1 and 2 there is the possibility that some other insert will run and also insert other values in columns C. Hence what I mean by "race condition" if I try to handle things at application level. I'm trying to find a way so that the business rule is always enforced at table level. Which your answer seems to give :) – lezebulon Nov 17 '21 at 22:07
  • 1
    @lezebulon That is exactly what a `TRANSACTION` is for. If you don't know what DB transactions are then you have **a lot** you still need to learn about databases and database-theory. – Dai Nov 17 '21 at 22:19
  • @Dai yes I realize while typing it that indeed all of this could be put in a single transaction. In that case it would still mean that doing an incorrect raw INSERT in the table could break the constraint as nothing would be checked – lezebulon Nov 17 '21 at 22:19
  • @lezebulon What do you mean by "incorrect raw `INSERT`"? There's no such thing as a "raw" INSERT. All DML operations go through triggers and constraints, and the constraint I've described will block an INSERT that would cause a duplicate row group as you've described. (There are vendor-specific things like SQL Server's "bulk insert" and `bcp` and the rest, but when you do then SQL Server marks the constraint as "untrusted", so that's not a problem because application code should reject databases with untrusted constraints). – Dai Nov 17 '21 at 22:21
  • @Dai sorry for the misunderstanding, the solution you propose with trigger will work. I meant that, only trying to handle the problem at application level with transactions is still weak to incorrect raw INSERT, if there are not trigger constraints added – lezebulon Nov 17 '21 at 22:25
  • Indexed view is definitely a good option to enforce this. Performance will be far better than a UDF in a check constraint. But my feeling is this is a case of poor normalization – Charlieface Nov 18 '21 at 09:37

2 Answers2

2

Rough definitions, for this answer:

  1. Simple constraint. A constraint that can be, without schema changes beyond the constraint definition, enforced by the straight forward use of one of SQL Server constraints: primary key, unique, foreign key, not null and check against a simple scalar expression. (I would include a unique filtered index as a simple constraint.)
  2. Complex constraint. A constraint that can not be so enforced.

Just as SQL server provides different constraints to handle different situations there are different techniques for enforcing more complex constraints. Which technique is appropriate depends on what one is trying to accomplish and the context one is in.

Broadly, there are two approaches to enforcing complex constraints, either add code to that prevents bad data or change the schema in such a way that SQL Server can enforce the constraint with a simple constraint.

Add code to enforce the constraint.

A common refrain is to use a trigger to enforce a constraint. So one writes an after trigger that checks whether the constraint has been violated by the DML that fired the trigger. If it has been violated the trigger throws an error, rolling back the transaction. Another approach is to take insert, update, delete rights away from all applications and force them through stored procedures that will error on attempts to violate the constraint. Using a UDF in a check constraint is new to me. The mechanism is similar to the after trigger, a DML statement made a change to the data. After that change, but before the DML statement is done, code is ran in the UDF and the result checked to see if the constraint is violated or not.

The big challenge with adding code that checks the constraint is correctly enforcing the constraint in the face of concurrency. Because transactions are isolated, the code verifying that the constraint is met may not see a conflicting change made by another session. See an example "Concurrency can cause trouble" at the end.

Now, some business rules are too complex to enforce with changes to the schema and code is required. I generally don't consider such business rules to be constraints that the database must enforce and place the code to enforce the business rules in either stored procedures or application code, with the other business rules. (Saying that these business rules are not database constraints does not getting correctness in the face of concurrency any easier.)

Change the schema so SQL Server can enforce the constraint.

A strength reduction from a complex constraint to a simple constraint. There is no one technique in this approach. Sometimes the application of a schema change that will allow SQL Server to enforce a constraint is easy to spot. And sometimes one may need to play with SQL Server features with outside the box thinking.Some schema changes that may be helpful:

  1. Increase the normalization. Many reasonable constraints cannot be expressed against denomalized data.
  2. Adding a persisted computed column to a table and using that column in a constraint.
  3. Indexed views can be amazing for constraint enforcement. (The restrictions placed on what views can be indexed is super frustrating.)

Example using another table

(This may count as a normalization.) Lets consider a table, like in your question, but adding one additional column, a primary key that allows update and delete of individual rows.

CREATE TABLE dbo.T (PK INT PRIMARY KEY IDENTITY(1, 1)
    , A CHAR(1) NOT NULL
    , B INT NOT NULL
    /* other columns as needed */);

And the rule is that the following query should never return a row:

SELECT 'Uh oh!' AS "Uh oh!"
FROM dbo.T
GROUP BY A
HAVING COUNT(DISTINCT B) > 1

To enforce this one can add the following table and constraint:

CREATE TABLE dbo.TConstraint (A CHAR(1) NOT NULL PRIMARY KEY /* Only one B is allowed for any A */
    , B INT NOT NULL
    , UNIQUE (A, B) /* provide a target for a foreign key from dbo.T */);

ALTER TABLE dbo.T ADD FOREIGN KEY (A, B)
REFERENCES dbo.TConstraint (A, B);

Now the application logic or stored procedures that write to dbo.T will have to change to write to dbo.EnforceComplexConstraintOnT first. Updates to dbo.T (B) are going to be messy because one will need to change the values in dbo.TCnstraint before dbo.T but that can't be done with the old values in dbo.T. So an update becomes:

  1. Delete row from dbo.T
  2. Update dbo.TConstraint
  3. Insert "changed" row into dbo.T.

That update logic is incompatible with an IDENTITY column and incompatible with any foreign key referencing dbo.T. So the contexts where this would work are limited.

Note also that this solution includes adding additional code for insert, update, and possibly delete. That extra code is not enforcing the constraint, but manipulating the data so that into a form such that SQL Server can enforce the constraint.

Example using indexed views

Indexed views come with caveats. See the documentation: https://learn.microsoft.com/en-us/sql/relational-databases/views/create-indexed-views?view=sql-server-ver15

Instead of the new table and constraint:

/* SET OPTIONS required by indexed views*/
SET ANSI_NULLS, ANSI_PADDING, ANSI_WARNINGS
    , CONCAT_NULL_YIELDS_NULL, QUOTED_IDENTIFIER ON;
SET NUMERIC_ROUNDABORT OFF;
GO

/* view shows all the distiinct A, B in dbo.T */
CREATE VIEW dbo.TConstraint WITH SCHEMABINDING AS
SELECT A
    , B
    , COUNT_BIG (*) AS MustHaveForGroupingInAnIndexedView
FROM dbo.T
GROUP BY A, B;
GO

CREATE UNIQUE CLUSTERED INDEX A_B ON dbo.TConstraint (A, B);
CREATE UNIQUE NONCLUSTERED INDEX A ON dbo.TConstraint (A);

INSERT INTO dbo.T (A, B)
VALUES ('a', 100), ('a', 200)

Msg 2601, Level 14, State 1, Line 27 Cannot insert duplicate key row in object 'dbo.TConstraint' with unique index 'A'. The duplicate key value is (a).

Pros: 1) No extra code to get the data into a form SQL Server can constrain. 2) Updates to dbo.T (B) can be done without doing the delete dbo.T update dbo.TConstraint insert dbo.T dance, and all the restrictions that creates. Cons 1) The view has to have schemabinding. 2) Deploying schema changes to dbo.T will involve dropping the indexes on the view, dropping the view, altering dbo.T, create the view, create the indexes on the view. Some deployment tools may struggle with the required steps. Large tables with large indexed views will have a longer deployment for changes to dbo.T.

Concurrency can cause trouble.

@Dai's answer will work in at least some cases, but fails in the following case.

Setup:

  1. Create dbo.TableX, dbo.GetCountInvalidGroupsInTableX(), and CK_Validity as detailed in Dai's answer.
  2. Enable snapshot isolation for the database. ALTER DATABASE CURRENT SET ALLOW_SNAPSHOT_ISOLATION ON
  3. Connect to the database with two separate sessions.

In session 1:

/* Read committed was the default in my connection */
SET TRANSACTION ISOLATION LEVEL READ COMMITTED;
BEGIN TRANSACTION;
INSERT INTO TableX VALUES(10, 100);

In session 2:

SET TRANSACTION ISOLATION LEVEL SNAPSHOT
INSERT INTO TableX VALUES(10, 200);
/* (1 row affected) This insert did not block
   on the open transaction */

In session 1:

COMMIT;
SELECT * FROM TableX;

/*
A  B
10 100
10 200
*/

To add code to enforce a constraint one must know what isolation level may be used and ensure that the DML will block when needed so that when the code checking that the constraint is still satisfied can see everything it needs to.

Summary

Just as SQL Server has different constraint types for different situations there are different techniques for enforcing constraints when using what SQL provides against the current data model is not enough. Prefer schema changes that allows SQL Server to enforce the constraints. SQL Server is far more likely to get concurrency correct that most people writing Transact-SQL.

Charlieface
  • 52,284
  • 6
  • 19
  • 43
Shannon Severance
  • 18,025
  • 3
  • 46
  • 67
  • Thanks, this is exactly the kind of answer I was looking for :) It still seems strange to me though that the CHECK constraint can be bypassed that "easily" – lezebulon Nov 18 '21 at 09:47
  • Note that cascading foreign keys can solve the "messiness" of updates – Charlieface Nov 18 '21 at 09:49
0
  • SQL Server supports UDFs in CHECK constraints.
  • While CHECK constraints are ostensibly a row-level-only constraint, a CHECK constraint can still use a UDF which checks the entire table, and I believe that then suits your purposes.
    • Note that CHECK constraints are evaluated for every row in a table (though only on-demand) and SQL Server isn't smart enough to detect redundant constraint evaluations. So be careful - and be sure to monitor and profile your database's performance.

(The inner-query in the function below returns the A and COUNT(DISTINCT x.B) values so you can easily copy+paste the inner-query into a new SSMS tab to see the invalid data. Whereas the function's actual execution-plan will optimize-away those columns because they aren't used by the outer-query, so there's no harm in having them in the inner-query).

CREATE FUNCTION dbo.GetCountInvalidGroupsInTableX()
RETURNS bit
AS
BEGIN

    DECLARE @countInvalidGroups int = (
        SELECT
            COUNT(*)
        FROM
            (
                SELECT
                    x.A,
                    COUNT(DISTINCT x.B) AS CountDistinctXB
                FROM
                    dbo.TableX AS x
                GROUP BY
                    x.A
                HAVING
                    COUNT(DISTINCT x.B) >= 2
            );
   
    RETURN @countInvalidGroups;

END

Used like so:

CREATE TABLE TableX (
    A int NOT NULL,
    B int NOT NULL,

    CONSTRAINT PK_TableX PRIMARY KEY ( etc )
);

CREATE FUNCTION dbo.GetCountInvalidGroupsInTableX() RETURNS bit
AS
    /* ... */
END;

ALTER TABLE TableX
    ADD CHECK CONSTRAINT CK_Validity CHECK ( dbo.GetCountInvalidGroupsInTableX() = 0 );

Dai
  • 141,631
  • 28
  • 261
  • 374
  • 1
    ..a 1 million rows table.. while inserting a combination that does not exist..the function has to aggregate 1 million rows .. – lptr Nov 17 '21 at 22:13
  • @lptr If you have an [indexed view](https://learn.microsoft.com/en-us/sql/relational-databases/views/create-indexed-views?view=sql-server-ver15) of the source table then it should be very fast - or even having a normal index on the table will still help. A million rows is nothing when you consider indexes reduce lookups from `n` to `log n` on average. NOTE: When using indexed-views the `HAVING` clause will need to be removed and the logic in the `CHECK` constraint slightly altered. – Dai Nov 17 '21 at 22:14
  • @lptr seems like it's due to the nature of the contraint itself... In my question I don't really care if it's fast or not, just how in theory you're supposed to handle such cases – lezebulon Nov 17 '21 at 22:18
  • @lezebulon .. if you use the function then at least pass the two columns as parameters of the function..then a simple exists() in the function would suffice, instead of aggregating all the rows of the table. – lptr Nov 17 '21 at 22:21
  • @lptr You can't "pass a column" into a function. Please clarify exactly what you're describing. – Dai Nov 17 '21 at 22:22
  • @lptr Yes, but that's a **single row**'s values - not the entire column's values. – Dai Nov 17 '21 at 22:32
  • @Dai … what do you mean by “entire column’s values”? – lptr Nov 17 '21 at 22:45
  • @lptr A _reified_ table column: something akin to a table-valued parameter, except representing a single column and all its data, passed by-reference. SQL Server doesn't support that. – Dai Nov 17 '21 at 22:58