Loading ...
Sorry, an error occurred while loading the content.
 

Re: [baseball-databank] Big multivariate analysis--questions

Expand Messages
  • Paul Wendt
    ... Based on what I know of the project, I might have guessed 1939 because in the Batting table I find 26106 records with null GIDP (ground into double play)
    Message 1 of 21 , Aug 16, 2006
      11 Aug 2006, Robert Ehrlich wrote:

      > I am about to engage in an analysis of baseball stats. The database
      > that I am using is Lahman-52. will be analyzing 1921 to 2004.
      >
      > Is this a decent choice for the data?

      Based on what I know of the project, I might have guessed 1939 because
      in the Batting table I find 26106 records with null GIDP (ground into
      double play) among the 28102 records with yearID < 1939. I suppose that
      GIDP beongs in the classification of players by type, but others must
      suppose the same about CS, IBB, and SF, for which we have null data some
      for time for some time after 1939.

      Discussion by this group throughout its history has covered the
      crucial and painful matters, let me name them, "Null or Zero?" and
      "Elegant Variation in the representation of null data". Those matters
      mainly plague older records. Check the archive, perhaps searching for
      'null', and you will find some corrections to lahman52.

      Crucial and painful matters aside, there is the problem that nulls in
      the database represent both missing data (unknown) and category
      anachronisms such as first base on hit by pitch (HBP) before AA1884 and
      NL1887. I don't know when between 1887 and 1950 this problem becomes
      trivial.

      > In that DB there are columns for singles, doubles, triples, etc. as
      > well as at-bats.
      >
      > The sum of such columns (including various ways to be out) is not the
      > same as the number of at-bats. However the numbers do not include
      > decimal points--so I assume that they represent counts.

      Yes, although the lahman database includes some derived "averages" too.

      Paul Wendt
    Your message has been successfully submitted and would be delivered to recipients shortly.