Loading ...
Sorry, an error occurred while loading the content.

Re: [baseball-databank] Re: Access and BDB

Expand Messages
  • Mat Kovach
    ... I am presently writing an article that shows how to get the BDB information into PostgreSQL and load the Retrosheet data into MySQL and PostgreSQL. This
    Message 1 of 21 , May 18 9:23 AM
    • 0 Attachment
      P Mondout wrote:

      > How cool would it be if we could come up with simple step by step
      > instructions (for MySQL, SQL Server, and Oracle, for example) for
      > importing Retrosheet and BDB data into these tools along with pointers
      > to similar step by step instructions to installing those databases and
      > their new front-ends themselves?

      I am presently writing an article that shows how to get the BDB
      information into PostgreSQL and load the Retrosheet data into MySQL and
      PostgreSQL. This is for a project I am working on
      (http://fungoes.mek.cc). The article will be Linux/UNIX specific but
      I'm sure some smart Windows person could do the same thing.

      > As to the question of Open Office's usability, I have never used it. I
      > have no reason to think Open Office's DB is anything less than first
      > rate.

      I have not loaded the data into Open Office, but I have used Open Office
      to connect to MySQL and PostgreSQL databases loaded with the data. It
      worked fine for me.

      Mat Kovach
    • Ben Matasar
      ... I think another option is SQLite[1], which is very fast and perfect for small queries. I did the conversion a couple months back, and have an SQLite
      Message 2 of 21 , May 18 10:40 AM
      • 0 Attachment
        > I am presently writing an article that shows how to get the BDB
        > information into PostgreSQL and load the Retrosheet data into MySQL and
        > PostgreSQL. This is for a project I am working on
        > (http://fungoes.mek.cc). The article will be Linux/UNIX specific but
        > I'm sure some smart Windows person could do the same thing.

        I think another option is SQLite[1], which is very fast and perfect
        for small queries. I did the conversion a couple months back, and
        have an SQLite schema on my website:

        http://matasar.org/blog/baseball/sqlite?showcomments=yes

        I highly recommend SQLite for smallish scripting language queries --
        there's almost no configuration necessary.

        1: http://www.sqlite.org/

        Ben
      • Dereck L. Dietz
        I m an Oracle programmer analyst/DBA. If you d like to discuss this further feel free to contact me directly. ... From: P Mondout To:
        Message 3 of 21 , May 18 2:13 PM
        • 0 Attachment
          I'm an Oracle programmer analyst/DBA.  If you'd like to discuss this further feel free to contact me directly.
          ----- Original Message -----
          From: P Mondout
          Sent: Thursday, May 18, 2006 10:35 AM
          Subject: [baseball-databank] Re: Access and BDB

          How cool would it be if we could come up with simple step by step
          instructions (for MySQL, SQL Server, and Oracle, for example) for
          importing Retrosheet and BDB data into these tools along with pointers
          to similar step by step instructions to installing those databases and
          their new front-ends themselves? I know the focus of this group is the
          data itself, and I would never want to suggest that focus should
          change, but I do think we would all be enriched by the resulting flood
          of research if everyone who writes about baseball had a copy a MySQL
          (or one of the others) with a good front-end loaded with BDB and
          Retrosheet data and a bunch of well-documented queries and
          instructions on how to make more. I'm aware of previous discussions
          about producing BDB in other than csv format and am not trying to
          restart that topic. Just seems like a worthwhile long-term goal since
          so many would love to have such a tool - even if they never used the
          database software for anything else - at their fingertips. I suppose
          all of this functionality could be offered by a website without
          requiring the researcher to know about or even have a database.
          Hmmm....

        • Tangotiger
          Someone else mentioned about using Access as a front-end. This is absolutely true, and yet another great useability feature of Access. It s a snap to make an
          Message 4 of 21 , May 18 6:04 PM
          • 0 Attachment
            Someone else mentioned about using Access as a
            front-end. This is absolutely true, and yet another
            great useability feature of Access. It's a snap to
            make an ODBC connection to any database through
            Access. What's cool is that Access makes
            views/queries so usable, that you can store the data
            in Oracle, and then use Access as the front-end. I
            once even had to move data from one Oracle database to
            another, and Oracle was making it very complicated for
            me. So, I used Access as an intermediary. Access has
            great features, and should be the choice for 80% of
            the people on this list.

            Tom

            -----------------------------------------------
            THE BOOK -- Playing The Percentages In Baseball
            http://www.InsideTheBook.com











            -----------------------------------------------

            __________________________________________________
            Do You Yahoo!?
            Tired of spam? Yahoo! Mail has the best spam protection around
            http://mail.yahoo.com
          • Dereck L. Dietz
            How was Oracle making it complicated to move data from one Oracle database to another? ... From: Tangotiger To: baseball-databank@yahoogroups.com Sent:
            Message 5 of 21 , May 18 6:44 PM
            • 0 Attachment
              How was Oracle making it complicated to move data from one Oracle database to another?
              ----- Original Message -----
              Sent: Thursday, May 18, 2006 9:04 PM
              Subject: Re: [baseball-databank] Re: Access and BDB

              Someone else mentioned about using Access as a
              front-end.  This is absolutely true, and yet another
              great useability feature of Access.  It's a snap to
              make an ODBC connection to any database through
              Access.  What's cool is that Access makes
              views/queries so usable, that you can store the data
              in Oracle, and then use Access as the front-end.  I
              once even had to move data from one Oracle database to
              another, and Oracle was making it very complicated for
              me.  So, I used Access as an intermediary. Access has
              great features, and should be the choice for 80% of
              the people on this list. 

              Tom

              -----------------------------------------------
              THE BOOK -- Playing The Percentages In Baseball
              http://www.InsideTheBook.com











              -----------------------------------------------

              __________________________________________________
              Do You Yahoo!?
              Tired of spam?  Yahoo! Mail has the best spam protection around
              http://mail.yahoo.com
            • Paul Wendt
              ... If you write 100 queries (and save them), how do you keep them straight? ... Is this like working with oracle all the time but using a mailreader most
              Message 6 of 21 , May 19 5:23 PM
              • 0 Attachment
                Tangotiger wrote, twice recently:

                > What's cool is that Access makes views/queries so usable,

                If you write 100 queries (and save them),
                how do you keep them straight?


                > I work with Oracle all the time, but I use Access most of the time.

                Is this like "working" with oracle all the time
                but "using" a mailreader most of the time?

                :-) but I am curious what you mean.

                Paul Wendt
              • Tangotiger
                ... Ahhh... Access has a properties button, which you can put comments and documentation for each view, which you can then see. That column is also
                Message 7 of 21 , May 20 6:45 AM
                • 0 Attachment
                  --- Paul Wendt <pgw02472@...> wrote:

                  > Tangotiger wrote, twice recently:
                  >
                  > > What's cool is that Access makes views/queries so
                  > usable,
                  >
                  > If you write 100 queries (and save them),
                  > how do you keep them straight?

                  Ahhh... Access has a "properties" button, which you
                  can put comments and documentation for each view,
                  which you can then see. That column is also sortable,
                  which is how I can keep my 100 queries all straight.

                  You can also set up separate databases, and instead of
                  "import table", you create an external link to those
                  tables.

                  >
                  >
                  > > I work with Oracle all the time, but I use Access
                  > most of the time.
                  >
                  > Is this like "working" with oracle all the time
                  > but "using" a mailreader most of the time?
                  >
                  > :-) but I am curious what you mean.
                  >

                  I work with Oracle outside of baseball. But use
                  Access most of the time for baseball. You seem to
                  think that life = baseball !

                  Tom


                  -----------------------------------------------
                  THE BOOK -- Playing The Percentages In Baseball
                  http://www.InsideTheBook.com











                  -----------------------------------------------

                  __________________________________________________
                  Do You Yahoo!?
                  Tired of spam? Yahoo! Mail has the best spam protection around
                  http://mail.yahoo.com
                • Paul Wendt
                  ... I considered that once long ago; maybe I should consider again. Now I have Query names as two dimensions of organization because they are both
                  Message 8 of 21 , May 24 8:31 AM
                  • 0 Attachment
                    Tangotiger <tangotiger@...> wrote:

                    > > Tangotiger wrote, twice recently:
                    > >
                    > > > What's cool is that Access makes views/queries so usable,
                    > >
                    > > If you write 100 queries (and save them),
                    > > how do you keep them straight?
                    >
                    > Ahhh... Access has a "properties" button, which you
                    > can put comments and documentation for each view,
                    > which you can then see. That column is also sortable,
                    > which is how I can keep my 100 queries all straight.

                    I considered that once long ago; maybe I should consider again.
                    Now I have Query names as two "dimensions" of organization
                    because they are both alphabetically sortable and readable.
                    I rejected Query descriptions, necessarily viewing only a small
                    subset of the names and descriptions in a list, in favor of
                    Query names only, optionally viewing all or many at once.
                    (The descriptions are like "from tmasc, RetroList 2003-01-01.")

                    With Query names and descriptions both alphabetically sortable
                    and readable, there are four dimensions for organization.
                    Of course, three might be used, leaving the fourth without any
                    organizational duties.

                    > You can also set up separate databases, and instead of
                    > "import table", you create an external link to those tables.

                    Thanks. This is clearly useful for some purposes,
                    not sure yet whether any of them are mine.


                    > > I work with Oracle all the time, but I use Access
                    > > most of the time.
                    > >
                    > > Is this like "working" with oracle all the time
                    > > but "using" a mailreader most of the time?
                    > >
                    > > :-) but I am curious what you mean.
                    >
                    > I work with Oracle outside of baseball. But use
                    > Access most of the time for baseball. You seem to
                    > think that life = baseball !

                    actually, "work" = baseball

                    That "most of the time" = baseball
                    we can take for granted!

                    Paul
                  • Tangotiger
                    ... Just to give you some ideas. When I was working on my project, I tackled many topics and subtopics. So, when I worked on Relievers and Leverage, I have a
                    Message 9 of 21 , May 24 5:49 PM
                    • 0 Attachment
                      --- Paul Wendt <pgw02472@...> wrote:
                      > > You can also set up separate databases, and
                      > instead of
                      > > "import table", you create an external link to
                      > those tables.
                      >
                      > Thanks. This is clearly useful for some purposes,
                      > not sure yet whether any of them are mine.
                      >

                      Just to give you some ideas. When I was working on my
                      project, I tackled many topics and subtopics. So,
                      when I worked on Relievers and Leverage, I have a
                      separate database, and I create a link to some of the
                      tables from my master database. Then, I set up
                      queries in this Relievers_Leverage database, so I can
                      extract what I need. I added documentation to the
                      queries, so I knew which order to run them in:
                      "1a - Relievers"
                      "1b - Relief Aces"
                      or whathaveyou. Since that column is sortable, as
                      well as the query name, it's a snap to get what I
                      needed. As well, you can do "materialized views"
                      (create table) from the existing table for performance
                      reasons (applies more to Access than anywhere else).

                      By spinning off databases, all linked to the same
                      master, it makes it easier to manage.

                      Hope this helps...

                      Tom


                      -----------------------------------------------
                      THE BOOK -- Playing The Percentages In Baseball
                      http://www.InsideTheBook.com











                      -----------------------------------------------

                      __________________________________________________
                      Do You Yahoo!?
                      Tired of spam? Yahoo! Mail has the best spam protection around
                      http://mail.yahoo.com
                    • Robert Ehrlich
                      I am about to engage in an analysis of baseball stats. The database that I am using is Lahman-52. will be analyzing 1921 to 2004. Is this a decent choice for
                      Message 10 of 21 , Aug 11, 2006
                      • 0 Attachment
                        I am about to engage in an analysis of baseball stats. The database
                        that I am using is Lahman-52. will be analyzing 1921 to 2004.

                        Is this a decent choice for the data?

                        In that DB there are columns for singles, doubles, triples, etc. as well
                        as at-bats.

                        The sum of such columns (including various ways to be out) is not the
                        same as the number of at-bats. However the numbers do not include
                        decimal points--so I assume that they represent counts.

                        Should I divide those columns by the at-bats to generate a string of
                        numbers that will total to the batting average? That is, a set of
                        columns that will sum to unity and a subset (hits, etc.) that will sum
                        to the batting average?

                        We are planning to eliminate all records where the number of games were
                        less than 20.

                        Depending on the complexity of the output, we may eliminate pitchers
                        from the input so as not to waste a degree of freedom.

                        Lastly we would like to have the names of potential collaborators in
                        interpreting and writing up this data.

                        our analytical procedure is something called "Polytopic Vector Analysis"
                        (PVA) and we have all ready done a few trial runs with interesting results.

                        Bob Ehrlich
                      • Tangotiger
                        What is it that you are trying to accomplish? What is the question that you are seeking an answer for? Tom ... THE BOOK -- Playing The Percentages In Baseball
                        Message 11 of 21 , Aug 15, 2006
                        • 0 Attachment
                          What is it that you are trying to accomplish? What is
                          the question that you are seeking an answer for?

                          Tom

                          --- Robert Ehrlich <bobehrlich@...>
                          wrote:

                          >
                          > I am about to engage in an analysis of baseball
                          > stats. The database
                          > that I am using is Lahman-52. will be analyzing
                          > 1921 to 2004.
                          >
                          > Is this a decent choice for the data?
                          >
                          > In that DB there are columns for singles, doubles,
                          > triples, etc. as well
                          > as at-bats.
                          >
                          > The sum of such columns (including various ways to
                          > be out) is not the
                          > same as the number of at-bats. However the numbers
                          > do not include
                          > decimal points--so I assume that they represent
                          > counts.
                          >
                          > Should I divide those columns by the at-bats to
                          > generate a string of
                          > numbers that will total to the batting average?
                          > That is, a set of
                          > columns that will sum to unity and a subset (hits,
                          > etc.) that will sum
                          > to the batting average?
                          >
                          > We are planning to eliminate all records where the
                          > number of games were
                          > less than 20.
                          >
                          > Depending on the complexity of the output, we may
                          > eliminate pitchers
                          > from the input so as not to waste a degree of
                          > freedom.
                          >
                          > Lastly we would like to have the names of potential
                          > collaborators in
                          > interpreting and writing up this data.
                          >
                          > our analytical procedure is something called
                          > "Polytopic Vector Analysis"
                          > (PVA) and we have all ready done a few trial runs
                          > with interesting results.
                          >
                          > Bob Ehrlich
                          >
                          >
                          >


                          -----------------------------------------------
                          THE BOOK -- Playing The Percentages In Baseball
                          http://www.InsideTheBook.com











                          -----------------------------------------------

                          __________________________________________________
                          Do You Yahoo!?
                          Tired of spam? Yahoo! Mail has the best spam protection around
                          http://mail.yahoo.com
                        • Robert Ehrlich
                          Question: Can the data vectors for each player be used to create a classification of ball players? First the number of player-types has to be determined from
                          Message 12 of 21 , Aug 15, 2006
                          • 0 Attachment
                            Question:

                            Can the data vectors for each player be used to create a classification
                            of ball players?

                            First the number of player-types has to be determined from the data
                            matrix.

                            Then once determined, the data vector for each player type is calculated
                            ad the proportionate contribution of each type to each player is
                            calculated.

                            In a prior run for example we determined that Ralph Kiner (!!!) was the
                            best hitter in post war baseball just a little bit better than Ted
                            Williams or Joe DiMaggio. Many of the now-notorious current hitters
                            become suddenly a "new" type of player at some point in their careers.

                            If we just take batting stats alone each player-type is represented by a
                            set of hitting statistics. Each player is represented by the
                            contribution of each type to his hitting stats. We can then graph his
                            performance over time. Another interesting finding in the last run was
                            the uniqueness of Willy Mays with respect to triples.

                            I helped design this software (I'm a kind of statistician) and use it
                            several times a week for more boring applications in chemistry and
                            environmental science.
                            Though we would try another whack at baseball stats for fun.

                            Tangotiger wrote:

                            > What is it that you are trying to accomplish? What is
                            > the question that you are seeking an answer for?
                            >
                            > Tom
                            >
                            > --- Robert Ehrlich <bobehrlich@...
                            > <mailto:bobehrlich%40residuumenergy.com>>
                            > wrote:
                            >
                            > >
                            > > I am about to engage in an analysis of baseball
                            > > stats. The database
                            > > that I am using is Lahman-52. will be analyzing
                            > > 1921 to 2004.
                            > >
                            > > Is this a decent choice for the data?
                            > >
                            > > In that DB there are columns for singles, doubles,
                            > > triples, etc. as well
                            > > as at-bats.
                            > >
                            > > The sum of such columns (including various ways to
                            > > be out) is not the
                            > > same as the number of at-bats. However the numbers
                            > > do not include
                            > > decimal points--so I assume that they represent
                            > > counts.
                            > >
                            > > Should I divide those columns by the at-bats to
                            > > generate a string of
                            > > numbers that will total to the batting average?
                            > > That is, a set of
                            > > columns that will sum to unity and a subset (hits,
                            > > etc.) that will sum
                            > > to the batting average?
                            > >
                            > > We are planning to eliminate all records where the
                            > > number of games were
                            > > less than 20.
                            > >
                            > > Depending on the complexity of the output, we may
                            > > eliminate pitchers
                            > > from the input so as not to waste a degree of
                            > > freedom.
                            > >
                            > > Lastly we would like to have the names of potential
                            > > collaborators in
                            > > interpreting and writing up this data.
                            > >
                            > > our analytical procedure is something called
                            > > "Polytopic Vector Analysis"
                            > > (PVA) and we have all ready done a few trial runs
                            > > with interesting results.
                            > >
                            > > Bob Ehrlich
                            > >
                            > >
                            > >
                            >
                            > -----------------------------------------------
                            > THE BOOK -- Playing The Percentages In Baseball
                            > http://www.InsideTheBook.com <http://www.InsideTheBook.com>
                            >
                            > -----------------------------------------------
                            >
                            > __________________________________________________
                            > Do You Yahoo!?
                            > Tired of spam? Yahoo! Mail has the best spam protection around
                            > http://mail.yahoo.com <http://mail.yahoo.com>
                            >
                            >
                          • Tangotiger
                            If by classification you mean a profile or style of player, that s a good project. I look forward to seeing what you have. If you want some ideas as to how
                            Message 13 of 21 , Aug 15, 2006
                            • 0 Attachment
                              If by classification you mean a profile or style of
                              player, that's a good project. I look forward to
                              seeing what you have. If you want some ideas as to
                              how to make the classification, you can do things
                              like:
                              (BB+K)/PA
                              3b/(2b+3b)
                              sb/(1b*.8+bb*.6)

                              They each represent something specific. In the 2006
                              Hardball Times Annual, they have profiles based on GB,
                              FB, LD, etc, tendencies. This data is now available
                              at Fangraphs.com, back to 2002.

                              However, the "proportionate" contribution is a long
                              and tired road, long-traveled. I suggest reading what
                              is out there, rather than reinventing the wheel. I'd
                              start (and stop) with Linear Weights.

                              Tom

                              --- Robert Ehrlich <bobehrlich@...>
                              wrote:

                              > Question:
                              >
                              > Can the data vectors for each player be used to
                              > create a classification
                              > of ball players?
                              >
                              > First the number of player-types has to be
                              > determined from the data
                              > matrix.
                              >
                              > Then once determined, the data vector for each
                              > player type is calculated
                              > ad the proportionate contribution of each type to
                              > each player is
                              > calculated.
                              >
                              > In a prior run for example we determined that Ralph
                              > Kiner (!!!) was the
                              > best hitter in post war baseball just a little bit
                              > better than Ted
                              > Williams or Joe DiMaggio. Many of the now-notorious
                              > current hitters
                              > become suddenly a "new" type of player at some point
                              > in their careers.
                              >
                              > If we just take batting stats alone each player-type
                              > is represented by a
                              > set of hitting statistics. Each player is
                              > represented by the
                              > contribution of each type to his hitting stats. We
                              > can then graph his
                              > performance over time. Another interesting finding
                              > in the last run was
                              > the uniqueness of Willy Mays with respect to
                              > triples.
                              >
                              > I helped design this software (I'm a kind of
                              > statistician) and use it
                              > several times a week for more boring applications in
                              > chemistry and
                              > environmental science.
                              > Though we would try another whack at baseball stats
                              > for fun.
                              >
                              > Tangotiger wrote:
                              >
                              > > What is it that you are trying to accomplish? What
                              > is
                              > > the question that you are seeking an answer for?
                              > >
                              > > Tom
                              > >
                              > > --- Robert Ehrlich <bobehrlich@...
                              > > <mailto:bobehrlich%40residuumenergy.com>>
                              > > wrote:
                              > >
                              > > >
                              > > > I am about to engage in an analysis of baseball
                              > > > stats. The database
                              > > > that I am using is Lahman-52. will be analyzing
                              > > > 1921 to 2004.
                              > > >
                              > > > Is this a decent choice for the data?
                              > > >
                              > > > In that DB there are columns for singles,
                              > doubles,
                              > > > triples, etc. as well
                              > > > as at-bats.
                              > > >
                              > > > The sum of such columns (including various ways
                              > to
                              > > > be out) is not the
                              > > > same as the number of at-bats. However the
                              > numbers
                              > > > do not include
                              > > > decimal points--so I assume that they represent
                              > > > counts.
                              > > >
                              > > > Should I divide those columns by the at-bats to
                              > > > generate a string of
                              > > > numbers that will total to the batting average?
                              > > > That is, a set of
                              > > > columns that will sum to unity and a subset
                              > (hits,
                              > > > etc.) that will sum
                              > > > to the batting average?
                              > > >
                              > > > We are planning to eliminate all records where
                              > the
                              > > > number of games were
                              > > > less than 20.
                              > > >
                              > > > Depending on the complexity of the output, we
                              > may
                              > > > eliminate pitchers
                              > > > from the input so as not to waste a degree of
                              > > > freedom.
                              > > >
                              > > > Lastly we would like to have the names of
                              > potential
                              > > > collaborators in
                              > > > interpreting and writing up this data.
                              > > >
                              > > > our analytical procedure is something called
                              > > > "Polytopic Vector Analysis"
                              > > > (PVA) and we have all ready done a few trial
                              > runs
                              > > > with interesting results.
                              > > >
                              > > > Bob Ehrlich
                              > > >
                              > > >
                              > > >
                              > >
                              > > -----------------------------------------------
                              > > THE BOOK -- Playing The Percentages In Baseball
                              > > http://www.InsideTheBook.com
                              > <http://www.InsideTheBook.com>
                              > >
                              > > -----------------------------------------------
                              > >
                              > > __________________________________________________
                              > > Do You Yahoo!?
                              > > Tired of spam? Yahoo! Mail has the best spam
                              > protection around
                              > > http://mail.yahoo.com <http://mail.yahoo.com>
                              > >
                              > >
                              >
                              >


                              -----------------------------------------------
                              THE BOOK -- Playing The Percentages In Baseball
                              http://www.InsideTheBook.com











                              -----------------------------------------------

                              __________________________________________________
                              Do You Yahoo!?
                              Tired of spam? Yahoo! Mail has the best spam protection around
                              http://mail.yahoo.com
                            • Paul Wendt
                              ... Of course. ... As Tom Tango suggested, that reveals mixed motives. I believe that it must get in the way of classifying players --which you might do by
                              Message 14 of 21 , Aug 16, 2006
                              • 0 Attachment
                                --- In baseball-databank@yahoogroups.com, Robert Ehrlich
                                <bobehrlich@...> wrote:
                                >
                                > Question:
                                > Can the data vectors for each player be used to create a
                                > classification of ball players?

                                Of course.

                                > First the number of player-types has to be determined from the data
                                > matrix.
                                >
                                > Then once determined, the data vector for each player type is
                                > calculated and the proportionate contribution of each type to each
                                > player is calculated.

                                As Tom Tango suggested, that reveals mixed motives. I believe that it
                                must get in the way of classifying players --which you might do by
                                defining distance between player vectors/matrices.

                                . . .
                                > I helped design this software (I'm a kind of statistician) and use
                                > it several times a week for more boring applications in chemistry
                                > and environmental science.
                                > Though we would try another whack at baseball stats for fun.

                                I guess that you are inclined to denominate by at bats, ignoring bases
                                on balls and hits by pitch, sacrifice bunts and flies, because popular
                                baseball uses at bats. You should at least try denominating by plate
                                appearances. I daresay at least 90% of this audience will scoff at
                                definition of player-types by data matrices that exclude bases on
                                balls and hits by pitch, if not lesser plate appearances.

                                In my opinion, it makes sense to include fielding data in the
                                analysis. Why not let the statistical(?) data analysis determine how
                                the distance between games played at second and third base interacts
                                with the distance between two-base and three-base hits?

                                Paul Wendt

                                P.S. I don't suppose you have the data to define kinds of
                                statisticians --only kinds of molecules and perhaps ballplayers.
                              • Paul Wendt
                                ... I guess that you are inclined to denominate by at bats, ignoring bases on balls and hits by pitch, sacrifice bunts and flies, because popular baseball uses
                                Message 15 of 21 , Aug 16, 2006
                                • 0 Attachment
                                  Moments ago, I wrote:
                                  >>
                                  I guess that you are inclined to denominate by at bats, ignoring bases
                                  on balls and hits by pitch, sacrifice bunts and flies, because popular
                                  baseball uses at bats. You should at least try denominating by plate
                                  appearances. I daresay at least 90% of this audience will scoff at
                                  definition of player-types by data matrices that exclude bases on
                                  balls and hits by pitch, if not lesser plate appearances.
                                  <<

                                  As more grist for the same mill, you should see the article by Jim
                                  Albert in a recent By The Numbers, newsletter of the SABR Statistical
                                  Analysis Cmte. Albert considers several rates that may be interpreted
                                  sequentially. The difference between plate appearances and at bats
                                  makes the first stage; strikeouts per atbat the second; home run rate
                                  per atbat-minus-strikeout the third.

                                  Anyway, consider whether than analytical method permits testing the
                                  significance of that first stage. Does it make sense to begin, as you
                                  are disposed to do, by simply throwing out the plate appearances that
                                  are not atbats.

                                  Paul Wendt
                                • Robert Ehrlich
                                  Paul: Thanks for the advice--I will include the other stats. The hitting example was a feeble attempt to describe what I am doing. The procedure is robust in
                                  Message 16 of 21 , Aug 16, 2006
                                  • 0 Attachment
                                    Paul:

                                    Thanks for the advice--I will include the other stats. The hitting
                                    example was a feeble attempt to describe what I am doing.

                                    The procedure is robust in that it doesn't "care" if some of the columns
                                    are highly correlated. The home run rate / at bat etc. will definitely
                                    be in the data matrix and the other relationships should come out in
                                    the wash. Will keep my eyes open


                                    Appreciate the heads up. Will try to reach Jim Albert. Is the By the
                                    Numbers accessible from the web?

                                    Bob Ehrlich

                                    Paul Wendt wrote:

                                    > Moments ago, I wrote:
                                    > >>
                                    > I guess that you are inclined to denominate by at bats, ignoring bases
                                    > on balls and hits by pitch, sacrifice bunts and flies, because popular
                                    > baseball uses at bats. You should at least try denominating by plate
                                    > appearances. I daresay at least 90% of this audience will scoff at
                                    > definition of player-types by data matrices that exclude bases on
                                    > balls and hits by pitch, if not lesser plate appearances.
                                    > <<
                                    >
                                    > As more grist for the same mill, you should see the article by Jim
                                    > Albert in a recent By The Numbers, newsletter of the SABR Statistical
                                    > Analysis Cmte. Albert considers several rates that may be interpreted
                                    > sequentially. The difference between plate appearances and at bats
                                    > makes the first stage; strikeouts per atbat the second; home run rate
                                    > per atbat-minus-strikeout the third.
                                    >
                                    > Anyway, consider whether than analytical method permits testing the
                                    > significance of that first stage. Does it make sense to begin, as you
                                    > are disposed to do, by simply throwing out the plate appearances that
                                    > are not atbats.
                                    >
                                    > Paul Wendt
                                    >
                                    >
                                  • Tangotiger
                                    ... Voros first described this process when he developed DIPS. He would break up the stat line into binary components: HBP, no HBP. Of the no HBP, walk or no
                                    Message 17 of 21 , Aug 16, 2006
                                    • 0 Attachment
                                      --- Paul Wendt <pgw@...> wrote:
                                      > As more grist for the same mill, you should see the
                                      > article by Jim
                                      > Albert in a recent By The Numbers, newsletter of the
                                      > SABR Statistical
                                      > Analysis Cmte. Albert considers several rates that
                                      > may be interpreted
                                      > sequentially. The difference between plate
                                      > appearances and at bats
                                      > makes the first stage; strikeouts per atbat the
                                      > second; home run rate
                                      > per atbat-minus-strikeout the third.

                                      Voros first described this process when he developed
                                      DIPS. He would break up the stat line into binary
                                      components: HBP, no HBP. Of the no HBP, walk or no
                                      walk. Of the no HBP, no BB, K or no K, and so on.

                                      This process, or one like it, is a great way to try to
                                      get to a player's skillset.

                                      Tom

                                      -----------------------------------------------
                                      THE BOOK -- Playing The Percentages In Baseball
                                      http://www.InsideTheBook.com











                                      -----------------------------------------------

                                      __________________________________________________
                                      Do You Yahoo!?
                                      Tired of spam? Yahoo! Mail has the best spam protection around
                                      http://mail.yahoo.com
                                    • Paul Wendt
                                      Kristin Campbell is a colleague, I suppose? ... The particular article by Jim Albert, advocating four rates understood sequentially,
                                      Message 18 of 21 , Aug 16, 2006
                                      • 0 Attachment
                                        Kristin Campbell <kcampbell53@...> is a colleague, I suppose?

                                        Rob Ehrlich wrote:
                                        > Thanks for the advice--I will include the other stats. The hitting
                                        > example was a feeble attempt to describe what I am doing.
                                        >
                                        > The procedure is robust in that it doesn't "care" if some of the columns
                                        > are highly correlated. The home run rate / at bat etc. will definitely
                                        > be in the data matrix and the other relationships should come out in
                                        > the wash. Will keep my eyes open
                                        >
                                        > Appreciate the heads up. Will try to reach Jim Albert. Is the By the
                                        > Numbers accessible from the web?

                                        The particular article by Jim Albert, advocating four rates understood
                                        sequentially, was recently published in BTN. Phil Birnbaum maintains a
                                        web archive covering his tenure as editor. See philbirnbaum.com


                                        Tom Tango wrote and I elliptcially endorsed:
                                        >>
                                        However, the "proportionate" contribution is a long and tired road,
                                        long-traveled. I suggest reading what is out there, rather than
                                        reinventing the wheel. I'd start (and stop) with Linear Weights.
                                        <<

                                        Upon rereading the original:
                                        That comment may rest on a misinterpretation of the contribution
                                        concept, which may be the contribution of ideal types to actual players.
                                        For example, Mike Hargrove is 39% Walkmeister, 3% Slugger, 1%
                                        Twinkletoes, etc, where Walkmeister, Slugger, Twinkletoes etc, are the
                                        ideal types --perhaps literally a few condensation points in the space.

                                        I endorse the suggestion to spend little time identifying
                                        "contributions" to team runs scored, runs saved, and win-loss decisions.
                                        Postpone that for a year.

                                        Paul Wendt
                                      • Paul Wendt
                                        ... Based on what I know of the project, I might have guessed 1939 because in the Batting table I find 26106 records with null GIDP (ground into double play)
                                        Message 19 of 21 , Aug 16, 2006
                                        • 0 Attachment
                                          11 Aug 2006, Robert Ehrlich wrote:

                                          > I am about to engage in an analysis of baseball stats. The database
                                          > that I am using is Lahman-52. will be analyzing 1921 to 2004.
                                          >
                                          > Is this a decent choice for the data?

                                          Based on what I know of the project, I might have guessed 1939 because
                                          in the Batting table I find 26106 records with null GIDP (ground into
                                          double play) among the 28102 records with yearID < 1939. I suppose that
                                          GIDP beongs in the classification of players by type, but others must
                                          suppose the same about CS, IBB, and SF, for which we have null data some
                                          for time for some time after 1939.

                                          Discussion by this group throughout its history has covered the
                                          crucial and painful matters, let me name them, "Null or Zero?" and
                                          "Elegant Variation in the representation of null data". Those matters
                                          mainly plague older records. Check the archive, perhaps searching for
                                          'null', and you will find some corrections to lahman52.

                                          Crucial and painful matters aside, there is the problem that nulls in
                                          the database represent both missing data (unknown) and category
                                          anachronisms such as first base on hit by pitch (HBP) before AA1884 and
                                          NL1887. I don't know when between 1887 and 1950 this problem becomes
                                          trivial.

                                          > In that DB there are columns for singles, doubles, triples, etc. as
                                          > well as at-bats.
                                          >
                                          > The sum of such columns (including various ways to be out) is not the
                                          > same as the number of at-bats. However the numbers do not include
                                          > decimal points--so I assume that they represent counts.

                                          Yes, although the lahman database includes some derived "averages" too.

                                          Paul Wendt
                                        Your message has been successfully submitted and would be delivered to recipients shortly.