Re: [baseball-databank] Data Error Checking & Documentation
- 9 Jan 2003, kjokbaseball <kjokbaseball@...> wrote:
> One good point that Cliff Otto raised to me offline is that, in orderYes. I suggest some classification.
> to find and correct errors in the data, we probably should develop
> some type of documentation to list known or found data discrepancies
> with comments about their current status.
At the highest level of generality, there are INTERNAL inconsistencies
which may indicate errors in the database, and there are EXTERNAL
maybe-errors where the database does not match some outside source.
KJ presented an internal inconsistency in the data. For AL1957 pitcher
strikeouts sum to one number and batter strikeouts sum to another.
Perhaps the "internal" class should be isolated: KJ undertakes only the
internal class and data about all internal instances is maintained in a
single table. (Then the "internal" class is implicit, or explicit only in
the name of the table, such as error_check_internal.) If so, then one
or two fields should be used to classify the instance, one or two to
locate it (maybe using hierarchical rule to list only one side of the
inconsistency), one or two to record its status, with at least one Comment
for freeform explanation of the status.
Where the database includes an official league total or official team
total, that total might not be equal to the sum of team or player data.
That is one class of error ("aggregation sum"?), located in one official
aggregate and the corresponding disaggregated data. The aggregate is
sufficient to identify the location, leaving the disaggregate implicit.
Anyway, pitcher sums and batter sums may differ where they should not.
Catcher sums and baserunner sums may differ where they should not.
That is another class of error ("dual sum"?), located in two sets of
disaggregated data. One side is sufficient to identify the location,
leaving the other side implicit.
For illustration, one record in table errorcheck_internal would be
official [known error in official statistics]
SumD [dual sums are unequal]
[more than three fields of "loc" data --location metadata--
will be used to identify the data with questionable sum.
These three represent all of the "loc" fields.]
KJOK [identify the person who checked this instance and reported
what appears in comment, who may not be the author of the record,
the comment, or the table]
Official strikeouts for AL1957 sum to 5800 for batters;
5814 for pitchers.
Initially, Kevin would locate all inconsistent dual sums and record each
in this table with
ID = autonumber
status = unknown
class = SumD
[complete location data]
As "unknown" cases are investigated and reported, two comment fields would
be filled with data and status "unknown" would be replace by status
"official" and status "oops" or something like that.
Do the same for inconsistent aggregation sums and record each with
class = SumA
Do the same for other classes of internal inconsistency. (If there are no
other classes, the table should be named something like sum_check.)
-- P/\/ \/\/t
Paul Wendt, Watertown MA, USA <pgw@...>