Re: [baseball-databank] Big multivariate analysis--questions
- 11 Aug 2006, Robert Ehrlich wrote:
> I am about to engage in an analysis of baseball stats. The databaseBased on what I know of the project, I might have guessed 1939 because
> that I am using is Lahman-52. will be analyzing 1921 to 2004.
> Is this a decent choice for the data?
in the Batting table I find 26106 records with null GIDP (ground into
double play) among the 28102 records with yearID < 1939. I suppose that
GIDP beongs in the classification of players by type, but others must
suppose the same about CS, IBB, and SF, for which we have null data some
for time for some time after 1939.
Discussion by this group throughout its history has covered the
crucial and painful matters, let me name them, "Null or Zero?" and
"Elegant Variation in the representation of null data". Those matters
mainly plague older records. Check the archive, perhaps searching for
'null', and you will find some corrections to lahman52.
Crucial and painful matters aside, there is the problem that nulls in
the database represent both missing data (unknown) and category
anachronisms such as first base on hit by pitch (HBP) before AA1884 and
NL1887. I don't know when between 1887 and 1950 this problem becomes
> In that DB there are columns for singles, doubles, triples, etc. asYes, although the lahman database includes some derived "averages" too.
> well as at-bats.
> The sum of such columns (including various ways to be out) is not the
> same as the number of at-bats. However the numbers do not include
> decimal points--so I assume that they represent counts.