3

here is a simple question to which I would like an answer to:

We have a member table. Each member practices one, many or no sports. Initially we (the developers) created a [member] table, a [sports] table and a [member_sports] table, just as we have always done.

However our client here doesn't like this and wants to store all the sports that the member practices in a single varchar column, separated with a special character.

So if:

1 is football
2 is tennis
3 is ping-pong
4 is swimming

and I like swimming and ping-pong, my favourite sports will be stored into the varchar column as:

x3,x4

Now we don't want to just walk up to the client and claim that his system isn't right. We would like to back it up with proof that the operation to fetch the sports from [member_sports] is more efficient than simply storing the fields as a varchar.

Is there any documentation that can back our claims? Help!

  • 5
    You don't need to go as far as documentation. Just pick a few of the more complex queries they are using, and then rewrite them using a schema based on a junction table with appropriate indices. You should immediately see some stark differences in performance. In general, storing CSV into a SQL table is not a good idea. – Tim Biegeleisen Nov 06 '17 at 13:46
  • 1
    If you want to find all people who like swimming you will have to use WHERE sport LIKE '%x4%' which will not be able to make use of indexes. Also what happens if you have 50 sports, how do you stop '%,x4%' matching x40 or x41 as well as x4 – Steve Ford Nov 06 '17 at 13:53
  • Why do they want to do this? What does it facilitate? – SMM Nov 06 '17 at 13:55
  • 2
    [First normal form](https://en.wikipedia.org/wiki/First_normal_form): "... when none of its domains have any sets as elements". People were looking at this sort of thing back in the *70s*. Since your client is seemingly unaware of this, how much should you value their knowledge of database design? – Damien_The_Unbeliever Nov 06 '17 at 14:05
  • 2
    Also might be worth pointing out to them that SQL Server has data types *designed* for holding multiple distinct values and querying for them - those being XML, JSON and *tables*. Using something not *designed* for holding multiple values has to have some really compelling justifications. – Damien_The_Unbeliever Nov 06 '17 at 14:12

1 Answers1

2

Ask your client if they care about storing accurate information1 rather than random strings.

Then set them a series of challenges. First, ensure that the sport information is in the correct "domain". For the member_sports table, that is:

sport_id int not null
         ^
         |--correct type

For their "store everything in a varchar column" solution, I guess you're writing a CHECK constraint. A regex would probably help here but there's no native support for regex in SQL Server - so you're either bodging it or calling out to a CLR function to make sure that only actual int values are stored.

Next, we not only want to make sure that the domain is correct but that the sports are actually defined in your system. For member_sports, that's:

CONSTRAINT FK_Member_Sports_Sports FOREIGN KEY (Sport_ID) references Sports (Sport_ID)

For their "store everything in a varchar column" I guess this is going to be a far more complex CHECK constraint using UDFs to query other tables. It's going to be messy and procedural. Plus if you want to prevent a row from being removed from sports while it's still referenced by any member, you're talking about a trigger on the sports table that has to query every row in members2`.

Finally, let's say that it's meaningless for the same sport to be recorded for a single member multiple times. For member_sports, that is (if it's not the PK):

CONSTRAINT UQ_Member_Sports UNIQUE (Member_ID,Sport_ID)

For their "store everything in a varchar column" it's another horrifically procedural UDF called from a CHECK constraint.

Even if the varchar variant performed better (unlikely since you need to be ripping strings apart and T-SQL's string manipulation functions are notoriously weak (see above re: regex)) for certain values of "performs better", how do they propose that the data is meaningful and not nonsense?

Writing the procedural variants that can also cope with nonsense is an even more challenging endeavour.


In case it's not clear from the above - I am a big fan of Declarative Referential Integrity (DRI). Stating what you want versus focussing on mechanisms is a huge part of why SQL appeals to me. You construct the right DRI and know that your data is always correct (or, at least, as you expect it to be)


1"The application will always do this correctly" isn't a good answer. If you manage to build an application and related database in which nobody ever writes some direct SQL to fix something, I guess you'll be the first.

But in most circumstances, there's always more than one application, and even if the other application is a direct SQL client only employed by developers, you're already beyond being able to trust that the application will always act correctly. And bugs in applications are far more likely than bugs in SQL database engine's implementations of constraints, which have been tested far more times than any individual application's attempt to enforce constraints.


2Let alone the far more likely query - find all members who are associated with a particular sport. A second index on member_sports makes this a trivial query3. No indexes help the "it's somewhere in this string" solution and you're looking at a table scan with no indexing opportunities.


3Any index that has sport_id first should be able to satisfy such a query.

Damien_The_Unbeliever
  • 234,701
  • 27
  • 340
  • 448