0

We have 10 years of archived sports data, spread across separate databases.

Trying to consolidate all the data into a single database. Since we'll be handling 10X the number of records, I'm trying to make schema redesign changes now to avoid potential performance hit.

One change entails breaking up the team roster table into 2 tables; one, a players table that stores fixed data: playerID, firstName, lastName, birthDate, etc., and another, the new roster table that stores variable data about a player: yearInSchool, jerseyNumber, position, height, weight, etc. This will allow us to, among other things, create career 4 year aggregate views of player stats.

Fair enough, makes sense, but then again, when I look at queries that tally, for example, a players aggregate scoring stats, I have to join on both player & roster tables, in addition to scoring and schedule tables, in order to get all the information needed.

Where I'm considering denormalizing is with player first and last name. If I store player first and last name in the roster table, then I can omit the player table from the equation for stat queries, which I'm assuming will be a big performance win given that total record count per table will be over 100K (i.e. most query joins will be across tables that each contain at least 100K records, and up to, for now, 300K).

So, where to draw the line with denormalization in this case? I assume duplicating first, last name is OK. Generally I enjoy non-duplication/integrity of data, but I suspect site visitors enjoy performance more!

Josh Crozier
  • 233,099
  • 56
  • 391
  • 304
virtualeyes
  • 11,147
  • 6
  • 56
  • 91
  • 1
    Hard to say, but somewhere in there you had a query that needed a join, and then you went all denormalize. Remember that a join (on an index ofcourse) is LIGHTNING QUICK, a simple btree on a milion rows is ready in under 10 steps, so no worries. You should look for points where you _do_ have worries. As long as your queries can use keys, you've got no problem? – Nanne Apr 26 '12 at 13:28
  • @Nanne I have yet to go all denormalize on that sh*t, am only considering doing so ;-) I agree re: looking for points where I do have worries. Would you say that 10 databases spread across 10 years is a maintainable solution? That is my primary concern right now, consolidating into a single database; secondarily is addressing potential performance hits, but it sounds like MySQL is wonderfully up to the task when dealing with the small-ish record sets I have... – virtualeyes Apr 26 '12 at 13:35
  • If your PlayerID is a primary key and your indexes on the other table are alright, then why do you need denormalization? 300K rows is not an awful lot. – Martin Apr 26 '12 at 13:35
  • sure, player id column is pkey to roster playerID column fkey, that's a given ;-) I was more concerned about multi-table joins where each table has over 100K records. For example, to generate a season composite schedule, we have to join across 8 tables; that's fine when each season is split up under a separate DB, not so sure when 10 seasons are combined into one – virtualeyes Apr 26 '12 at 13:38

1 Answers1

2

First thought is, are you sure you've exhausted tuning options to get good SELECT performance without denormalising here?

I'm very much with you in the sense of "no sacred cows" and denormalise when necessary, but this sounds like a case where it shouldn't be too hard to get decent performance.

Of course you guys have done your own exploration, if you've ruled that out then personal opinion is it's acceptable, yeah.

One issue - what happens if a player's name changes? Can it do so in your system? Would you use a transation to update all roster details in a single COMMIT operation? For a historical records Db this could be totally irrelevant mind you.

Brian
  • 6,391
  • 3
  • 33
  • 49
  • +1, @Brian feedback thus far is YAGNI, stay normalized, MySQL is a beast. Just irks me a bit to make that extra join to get data that currently, in the DB per year setup, lives all in 1 table, not 2. re: name change in denormalized setup, that's the kind of thing you have to account for and another pro-normalization vote. – virtualeyes Apr 26 '12 at 13:43
  • 1
    if it's for convenience, rather than performance, you could create and use a *view* for your day-to-day usage? E.g. create V_ROSTER_DATA that does the join between T_ROSTER_DATA and T_PLAYER. This can take a little getting used to if you work exclusively with plain old tables, but it's very useful and you can quickly build up a good library of such convenience views – Brian Apr 26 '12 at 14:50
  • yes have been thinking about taking the view route. One problem with views is that non-DBAs will be utterly puzzled as to where this "magical" query data is coming from. Moving from LAMP stack to the JVM I have grown fond of stacking typing and explicit intent; views would move away from that, but they clearly reduce boilerplate among other things. For the time being I'll stay normalized, the extra join code-wise is no big deal provided there's little-to-no performance hit – virtualeyes Apr 26 '12 at 15:31
  • @Bryan, thought this through a bit more and, well, magic can be useful ;-) views solves a difficult problem I've been trying to work around with the statically typed ORM we're using (ScalaQuery), where db-specific functionality/specialization is often not supported. With views I can trick the ORM into selecting columns that I specify freely in the view while retaining static, functional queries with entity mapped results at the ORM level. – virtualeyes Apr 26 '12 at 18:31