2

Question

I am wondering how to design a relational database for NBA games.

There are several similar questions in StackOverflow but no one gives a satisfying answer.

Star Schema database for NBA/Basketball statistics

Database for Basketball teams/games/scores per season

Database Design NBA

Apparently the game data is interesting for basketball fans and tons of websites provide these statistics.
Also, video games for basketball, football, ... should maintain similar data too.

Is there a standard/recommended way to design these databases?

Example

The solution that comes into my mind looks like:

Team and Game

Country (CountryCode pk, CountryName)

City (Country pk fk, CityCode pk, CityName)

Team (TeamCode pk, TeamName, City fk)

Person (Name pk, Birthday pk, Height, Weight, HomeTown fkCity)

Employment (Employee pk fkPerson, JoiningDate pk, TeamCode fk2, DepartureDate, Salary, JerseyNumber)

Game (City pk fk1, Date pk, HostTeam fk2, VisitingTeam fk3)

GamePlayer (Game pk fk, Player pk fkEmployee, Position)

Game events

Shooting (Game pk fk1, TimeOfGame pk, Player fk2, ShootingType, Missed)

Rebound (Game pk fk1, TimeOfGame pk, Player fk2, IsOffensive)

Block (Game pk fk1, TimeOfGame pk, Blocker fk2, Blocked fk3)

Steal (Game pk fk1, TimeOfGame pk, Stealer fk2, Stealed fk3)

Other game events...

Another way is to store only the stats (derived data) for each game.

GameStats (Game pk fk1, Player pk fk2, Minutes, FieldGoal, FieldGoalMissed, ThreePointMade, ThreePointMissed, FreeThrow, FreeThrowMissed, DefensiveRebounds, OffensiveRebounds, Blocks, ...)

Historical Data

One difficulty I met for this data model is that players can change teams, change jersey numbers so we need to store historical data.

The events (Shooting, Rebound, Timeout, Substitution, ...) in a game are historical data too.

Derived Data

Another difficulty I met is to decide whether to store derived data.

Since derived data causes Update Anomaly, we should avoid them. However, I find it difficult to avoid derived data completely.

For example

  1. The game statistics are derived data. In order to avoid them, we have to store the events.
  2. The minutes a player played for a game is derived from the events (Substitution/Timeout/...).
  3. Some statistics are related, Turnover is derivable from Steal/Block/...
  4. For a game event, the game time can be derived from the time of day and other events (Substitution/Timeout/...).

Shall we make a compromise and store derived data for simplicity?
Does a truly normalized database contain derived data?


Edit 1 - Neutral Stadium

To account for neutral stadiums where no team is the home team, we can use the following design.

Stadium (City pk, Name pk, Capacity)

Game (Stadium pk fk, Date pk)

TeamRole (TeamRole pk) ['Home', 'Away', 'Neutral']

GameTeam (Game pk fk, Team pk fk, TeamRole pk fk)

Community
  • 1
  • 1
dzhu
  • 793
  • 1
  • 9
  • 17
  • 1
    Normalization is about keeping a relational schema DRY. Design needs to capture all the details you're interested in. Since trades and free agency are a fact of life, I'd say that any schema that does not include the element of time is leaving out a key feature of the problem. – duffymo Jul 11 '15 at 14:07
  • I'm currently building an NBA app that 'scrapes' from nba.com - it can be found at http://www.github.com/jemagee/git. It's still a work in progress but you can see my schema. Tables for play by play and shot location are not done as of yet because I'm still learning how to write complex queries - but you should be able to look at the schema file and figure out my models (I don't know how to draw the models myself) – MageeWorld Jul 05 '16 at 22:57

1 Answers1

4

It's impossible to give a complete answer without knowing all the details, but I can give you some pointers that will make your life easier.

In your description, tables Country, City and Team seems fine, but I don't understand why you need the City fk in the Person's table (unless it's city of birth, but then the column name should make that clear).

Employment

Now, about the Employment table - My suggestion is a little different:
First, you need to decide if you want to keep only the players or all game related stuff members of the team (such as team coach) in that table.
I would suggest the second option, meaning that you need to keep another table for job types (i.e player, coach, assistance coach etc.) and have an fk to it in the employment table.

Historical data

As for the historical data problem - there are some rules you want to take into consideration: for example, if your player is currently employed in a team, he probably can't be employed in another team at the same time, However, he can be a part of his countries national team while employed in a regular team. To cope with that, you will have to create some business logic rules in the forms of check constraints and perhaps some triggers to keep data integrity.
Coping with the teams change is quite easy, since you have the player's joining date and departure date for the team - all you need is to make sure that there is no overlapping between employments for any player in regular teams (remember the national team is an exception).

Changes while the player is employed in a team

To cope with the changes in jersey number, or whatever data that might be changed while the player is employed by the team, I suggest adding a table for employment details, that will be connected to the employment table, have a start date and end date and keep all the data that might change while the player is employed in a team. Again, you will need to make sure records do not overlap (date wise) for each employment id.

Active players

The active players table should be calculated per game, so it should have the game id as an fk. I suggest a simply table with gameId, teamId and playerId, while the primary key is gameId+PlayerId (I've left the teamId out of the primary key since it's another layer of protection in case the employment data is messed up and a player is considered employed in both teams on the date of the game)

Games

The Game table should have a primary key of game date + home team + away team. Note that games might be held on neutral stadiums, so you would want to test for games where the home team and away team are flipped before inserting new game data in your table. Also, you might want to keep a stadium table that will have an fk to the city and keep the stadium id in the games table instead of the city id.

Game statistics

You should strive to keep a single table to describe all events in the game. This table should of course have an fk to the games table and an fk to the teams table. From your question I understand that the statistics is done on the player level, so you also want to keep an fk to the players table.
Add a table for statistics details description and an fk to that table in the games statistics table. Basically, it should look like this:

StatisticsDetails (Detail_Id, Detail_Name)
and should hold data such as shoot, pass, rebound etc.

In you game statistics table you might need to log events that involves several players. there are a few ways to do it, I suggest to keep it as simple as possible: keep PlayerID1, PlayerId2, PlayerId3 in the table, and have all of them nullable except the first one.

Zohar Peled
  • 79,642
  • 10
  • 69
  • 121
  • Thanks for your answer. It's very insightful. a) City in Person is intended to represent HomeCity. b) Adding a JobType table is a good idea. c) The table name ActivePlayer is not clear. It's better called Roster. Some players may be included in the roster but not involved in a game. – dzhu Jul 14 '15 at 04:41
  • d) I think either Date + HomeTeam or Date + AwayTeam can identify a game, **is Date + HomeTeam + AwayTeam superfluous**? Good idea to use a fk to Stadium instead of City. – dzhu Jul 14 '15 at 04:45
  • e) For game statistics, **is your *StatisticsDetails* too general**? There would be a lot of **NULL**s. Also, PlayerId1, PlayerId2, PlayerId3 seem like **Repeating Groups**. – dzhu Jul 14 '15 at 04:47
  • f) I didn't thought of *neutral stadiums*, but it's a valid case. For neutral stadiums, there are no home or away teams. That means **both teams are equally important** - none of them is special. However, in relational databases, **different columns serve different roles**. ... – dzhu Jul 14 '15 at 05:05
  • For example, *(Col1: 'Spurs', Col2: 'Heat')* and *(Col1: 'Heat', Col2: 'Spurs')* are two different tuples, no matter what Col1 and Col2 mean. Return to the neutral stadium case, if 'Spurs' plays against 'Heat', we can store *('Spurs', 'Heat')* or *('Heat', 'Spurs')* as *(HomeTeam, AwayTeam)*. Both are ok but we can't store both. Shall we use a **constraint** to check that? It seems like **Repeating Groups** too. – dzhu Jul 14 '15 at 05:07
  • The Comments are probably not the best platform for this discussion, but I'll try to answer anyway: **a)** don't call it city, call it hometown. you'll thank me 2 years from now for that. **b)** note that different employment roles means different data to keep. i.e a coach has no jersey. **c)** all players that are active in the team on the date of the game should be in the Roster table. (I've done something similar for soccer, my table was called GameTeamPlayer). the point is to take a snapshot of the team stuff members on the day of the game. – Zohar Peled Jul 14 '15 at 05:22
  • **d)** a game is defined by both teams and a date. I know that in soccer a team is only allowed to play a single game on any specified date, and I'm guessing NBA rules also don't allow 2 games in the same date for any team, but keep in mind neutral stadiums... **e)** again, this comes from my experience with soccer. Keeping the statistics in a flat table (with default values or nulls in columns that are not used) provides a faster and simpler select statements. You can normalize it if you want, but it will make your life harder, and with no real benefit (storage is so cheep these days...) – Zohar Peled Jul 14 '15 at 05:27
  • **f)** neutral stadiums are valid, but the number of games held in neutral stadiums is almost negligible. I didn't find it a sufficient reason to change the *HomeTeamId* and *AwayTeamId* columns to something else. In my case, I was getting the games data from different providers, so I have an import stored procedure that checks both teams and the date, and when I had an existing record with reversed teams I just added a bit flag to indicate both records needs a manual check as they are duplicates. On that note, you might want to add a status column to your games table. – Zohar Peled Jul 14 '15 at 05:33
  • Thanks again. I think your design definitely works and is reasonably simple. But I am not sure whether further normalization is a better idea. I prefer normalization, which eliminates redundant data, prevents update anomaly, but results in more smaller tables. – dzhu Jul 16 '15 at 04:53
  • Again, I'm writing from my experience. I was working for a company that collected statistics details for soccer, and having the data table rows matching the grid rows the analysts had in the application was an easy solution. I agree that in most cases, normalization is desired, but if I did normalize the statistics details table I would have had to work a lot harder to create the analysts application, and to create the reports sent to the company's clients, seems to me that the cost of keeping default values for each column that was not needed was minor compared to the cost of development... – Zohar Peled Jul 16 '15 at 05:25
  • btw, If an answer solves your problem you should accept it so that other people know the problem is solved... – Zohar Peled Jul 16 '15 at 05:26