Loading ...
Sorry, an error occurred while loading the content.
 

Re: [baseball-databank] Data Error Checking & Documentation

Expand Messages
  • Paul Wendt
    ... Yes. I suggest some classification. At the highest level of generality, there are INTERNAL inconsistencies which may indicate errors in the database, and
    Message 1 of 2 , Jan 9, 2003
      9 Jan 2003, kjokbaseball <kjokbaseball@...> wrote:

      > One good point that Cliff Otto raised to me offline is that, in order
      > to find and correct errors in the data, we probably should develop
      > some type of documentation to list known or found data discrepancies
      > with comments about their current status.

      Yes. I suggest some classification.

      At the highest level of generality, there are INTERNAL inconsistencies
      which may indicate errors in the database, and there are EXTERNAL
      maybe-errors where the database does not match some outside source.
      KJ presented an internal inconsistency in the data. For AL1957 pitcher
      strikeouts sum to one number and batter strikeouts sum to another.

      Perhaps the "internal" class should be isolated: KJ undertakes only the
      internal class and data about all internal instances is maintained in a
      single table. (Then the "internal" class is implicit, or explicit only in
      the name of the table, such as error_check_internal.) If so, then one
      or two fields should be used to classify the instance, one or two to
      locate it (maybe using hierarchical rule to list only one side of the
      inconsistency), one or two to record its status, with at least one Comment
      for freeform explanation of the status.


      Where the database includes an official league total or official team
      total, that total might not be equal to the sum of team or player data.
      That is one class of error ("aggregation sum"?), located in one official
      aggregate and the corresponding disaggregated data. The aggregate is
      sufficient to identify the location, leaving the disaggregate implicit.

      Anyway, pitcher sums and batter sums may differ where they should not.
      Catcher sums and baserunner sums may differ where they should not.
      That is another class of error ("dual sum"?), located in two sets of
      disaggregated data. One side is sufficient to identify the location,
      leaving the other side implicit.


      For illustration, one record in table errorcheck_internal would be
      something like

      ID
      1 [autonumber]
      status
      official [known error in official statistics]
      class
      SumD [dual sums are unequal]
      loc_table
      loc_field
      loc_yr
      1957
      loc_*blahblah*
      [more than three fields of "loc" data --location metadata--
      will be used to identify the data with questionable sum.
      These three represent all of the "loc" fields.]
      comment_ID
      KJOK [identify the person who checked this instance and reported
      what appears in comment, who may not be the author of the record,
      the comment, or the table]
      comment
      Official strikeouts for AL1957 sum to 5800 for batters;
      5814 for pitchers.

      Initially, Kevin would locate all inconsistent dual sums and record each
      in this table with

      ID = autonumber
      status = unknown
      class = SumD
      [complete location data]
      comment_ID =
      comment =

      As "unknown" cases are investigated and reported, two comment fields would
      be filled with data and status "unknown" would be replace by status
      "official" and status "oops" or something like that.


      Do the same for inconsistent aggregation sums and record each with

      class = SumA

      Do the same for other classes of internal inconsistency. (If there are no
      other classes, the table should be named something like sum_check.)

      -- P/\/ \/\/t

      Paul Wendt, Watertown MA, USA <pgw@...>
    Your message has been successfully submitted and would be delivered to recipients shortly.