Documentation

This documentation will be expanded during Q1 2008. In the meantime, the notes below tackle some of the main design aspects of SportsDB. And the schematic database diagram and generated table- and field-level documentation found on the Specification Page are also helpful.

As can be seen on the Schematic Diagram, tables that are closely related to one another are boxed together as groups. Those groups that are applicable to any and all sorts of sports run down the left side of the diagram, and those that are specific to a particular sport are presented down the right side. This distinction between "core" and "specific" properties is also made in the companion SportsML standard. Also as with SportsML, even the specific-sport tables and fields are constructed in a consistent manner, so that once software is written for one sport, it is readily adaptable to work for other sports.

Core Groups

Documents Group

In many architectures, a SportsDB gets populated by a piece of middleware that processes a syndicated sports data file, preferably in XML, and even more preferably in SportsML. Of course, this is not the only way that a SportsDB could get filled with information. Custom "CRUD" systems (CReate/Update/Delete) can be built to load data directly into various tables. In such cases, tables in the documents group would likely remain untouched.

But for those cases in which SportsML messages come through and get parsed and loaded into SportsDB, then tables within the documents group hold all the info about each file, including filename, file path, and metadata.

Selected Tables from Documents Group
documents One row per document coming down the feed.
document_contents Stores contents of the abstract (if any), and Path to the full SportsML
teams_documents, persons_documents, affiliations_documents, document_fixtures Stores metadata about each document
events_documents Stores the event-key (unique game ID) for any event listed in each document
latest_revisions Stores the latest doc-id for any given revision-id (allowing you to track the latest version of a document)
Selected Fields in the Documents Table
doc_id The internal string from the document that purports to be its global, unique key
date_time The datetime that this document was published
title A displayable title for search results
language The language that prose within the document was published in.
publisher_id The publisher of the document
document_fixture_id Very important. Used for document type.
db_loading_date_time When the document was loaded into SportsDB

Events Group

The SportsML term for a game or a match -- any competition that has a winner -- is an "event." This group contains tables that describe the event and denote its participants.
Selected Tables from Events Group
events One row per event.
participants_events One row per participant -- could be a team, as with basketball games, or a player, as with a typical golf match.
periods One row per participant per period/inning/quarter. Includes the subscore for that participant (e.g., the number of points scored by the New York Knicks in the 2nd Quarter of a basketball game)
sub_periods One row for each sub-period in the period. For example, the number of points scored by Pete Sampras in the 4th game of the 5th set of a tennis match)
affiliations_events The league(s) this event is a part of.
events_sub_seasons The sub_season (and by extension, season) that this event is a part of.
Selected Fields from Events Table
event_key Unique for any given publisher-key. All event_keys for the entire season are generally mapped-out ahead of time.
site_id The location the event is played at.
start_date_time Time the game is/was scheduled to start
site_alignment Whether the site is the home-team's site, or is a neutral site
event_status Could be pre-event, mid-event, post-event, postponed, cancelled, etc.,
last_update The date_time of the most recent document used to update the properties, scores, and/or stats of this event.

Reference Groups: Person, Team, Affiliation, Season, etc.

(info to come.)

IDs and Keys

The id field

SportsDB allows for id fields to be assigned to most tables. These id's are intended to be database-assigned sequence numbers, as opposed to commonly shared "keys". Hence, if SportsDB user Bill loads a SportsML file into his database with stats on Mike Piazza (player-key piazza01), then his system may generate an id for that player of 903112 (because that's the "next free id value" on his system. However, if Jane loads in the same file into her DB, her system may give the same player an id of 350884.

While this system of "Local id's" may seem to complicate the process of transferring whole database from one computer to another, it's accepted and common DB design practice to use such system-dependent id's to act as foreign keys linking one table to another.

This frees the software from having to worry about generating complex unique codes, or combining separate fields in order to cross-link tables.

In practice, migrating tables to another database is simple, either by keeping the id's as they are, or by generating them anew in the new database tables. They are meant to be local to a specific database instance, and should not be used in public-facing identifiers such as URLs.

The *_key fields

The commonly shared version of player identifiers (e.g., "piazza01"), known in SportsML parlance as the player-key, also gets stored in SportsDB, but should not be used for cross-linking tables.

It's a common headache in sports data publishing that the world hasn't yet agreed upon a single universal and unique identification system for people (preferably assigned at birth, and barcoded across a person's chest :-).

Universal Alias System

SportsDB includes a Universal Alias System for person-keys, team-keys, and other keys to identify when two publishers use different keys to refer to the same player, team, etc. As an example:

Observe these two entries in the "persons" table, offering two different player-keys for George Brett, one from one publisher (Stats, LLC, publisher_id = 1), and one from another (Retrosheet, publisher_id = 2)

idpublisher_idperson_key
98121 (id for stats.com)147
103912 (id for retrosheet.org)gbrett01

A "key_aliases" table would then have two rows corresponding to these two rows in the "persons" table:

idkey_idkey_root_id
58019812450
580210391450

Note that these two rows share a common key_root_id value. That's what tells you that these two rows are providing aliases for the exact same entity. Which type of entity is it? (e.g., is it a person? is it a team? a league?) Look in the "key_roots" table to find out:

idkey_type
450persons

So here we have an example of cross-table polymorphism. Usually, with SportsDB, the *_type and *_id fields are companions in the very same table. The *_type field tells you which table to join with, and the *_id field tells you which row in that table. But in the spirit of reducing internal database redundancy (and reducing an opportunity for error), the key_type field is placed in the key_roots table.

How could SportsDB get populated with key aliases information? People could publish Equivalency Tables that look like this:

key_typestats.comretrosheet.orgsportsnetwork.com... etc etc
player147gbrett01l.mlb.com-p.4125

Note that this isn't a database table... it's just a tab-delimited file that people could maintain using spreadsheets (or they could export it from some other data storage system).

The SportsCodes.org initiative, a companion to SportsDB, aims to be a wiki-esque Player ID Equivalency maintainer of sorts, one that would export updates in this format.

A SportsDB Loading Program could know how to read these Equivalency Tables, and know when to do an UPDATE of rows in "key_aliases" and "key_roots", verses when to do an INSERT of a new row. As far as DELETES go, a SportsDB user would likely just need a "clear things out" function that deletes ALL rows for a particular publisher_key (optionally narrowed down by key_type).

Statistics

  • It's a three-fold polymorphic table join! (or something like that)
  • To find a Player's stats for one game (note: this is pseudo-sql):
    • select stats.stat_repository_type, stats.stat_repository_id from stats where:
      • stats.stat_holder_type = person and stats.stat_holder_id = (that player's person.id)
      • stats.stat_coverage_type = event and stats.stat_coverage_id = (that event's events.id)
    • The resulting stats.stat_repository_type field tells you the table name wherein a row of statistics can be found (e.g., baseball_offensive_stats)
    • The companion stats.stat_repository_id matches an id value in that table (e.g., the baseball_offensive_stats.id field)
    • Now just select * from [$stats.stat_repository_type] where [$stats.stat_repository_type].id = [$stats.stat_repository_id]
  • To find a Team's stats for the regular season of some year:
    • select stats.stat_repository_type, stats.stat_repository_id from stats where:
      • stats.stat_holder_type = team and stats.stat_holder_id = (that team's teams.id)
      • stats.stat_coverage_type = sub_season and stats.stat_coverage_id = (that regular season's sub_seasons.id)
    • The resulting stats.stat_repository_type field tells you the table name wherein a row of statistics can be found (e.g., baseball_offensive_stats)
    • The companion stats.stat_repository_id matches an id value in that table (e.g., the baseball_offensive_stats.id field)
    • Now just select * from [$stats.stat_repository_type] where [$stats.stat_repository_type].id = [$stats.stat_repository_id]
  • To find the teams involved in an event:
    • select teams.team_key display_names_full_name, particpants_events.alignment, participants_events.score from display_names, participants_events,teams where:
      • participants_events.event_id = events.id
      • events.event_key = (event key for game)
      • display_names.entity_id = participants_events.participant_id
      • display_names.entity_type = 'teams'
      • teams.id = participants_events.participant_id
    • The results the full name and team key gives you the full name of each team that participated in that event
    • You also get home or away (participants_events.alignment) and the score for each team (participants_events.score)
    • This query is an example of using a polymophic join. Whenever you are joining on a field like table.*_id you have to also join on table.*_type field as well. If you don't do that then you will be pulling in lots of unrelated data. Try the above query but leave out the display_names.entity_type condition from it

Other Notes

  • Use the baseball_action_substitutions table for the original game lineup (sequence_number == 0 means this is a player's starting position)
  • Concept of: "The Z-Field" for extensibility
    • It's OK to invent a new column for any table in SportsDB. Convention is to prefix that column-name with "z_". There's a requirement that Z-Fields go at the end, but it is recommended that SportsDB systems perform updates only via Named Fields, rather than perform insertions using assumptions of what column is, say, the 3rd one in a table.
  • Follow ISO standards for 2-letter country codes (US, BR, UK)
  • Follow ISO standard for language codes from 3066 (generally 5 characters, as in en-US, pr-BR, en-UK)