Loading ...
Sorry, an error occurred while loading the content.

Re: [baseball-databank] Re: Access and BDB

Expand Messages
  • John Walsh
    ... Much of what you are asking for here is available in the book Baseball Hacks by Joseph Adler, which has already been mentioned. I don t think it
    Message 1 of 21 , May 18, 2006
    • 0 Attachment
      On 5/18/06, P Mondout <awesome80s@...> wrote:

      How cool would it be if we could come up with simple step by step
      instructions (for MySQL, SQL Server, and Oracle, for example) for
      importing Retrosheet and BDB data into these tools along with pointers
      to similar step by step instructions to installing those databases and
      their new front-ends themselves? I know the focus of this group is the
      data itself, and I would never want to suggest that focus should
      change, but I do think we would all be enriched by the resulting flood
      of research if everyone who writes about baseball had a copy a MySQL
      (or one of the others) with a good front-end loaded with BDB and
      Retrosheet data and a bunch of well-documented queries and
      instructions on how to make more.


      Much of what you are asking for here is available in the book "Baseball Hacks" by Joseph Adler, which has already been mentioned.  I don't think it contains "step-by-step" instructions for installing mysql (for example), but those are easily enough found at www.mysql.com.

      Adler's book does contain quite a few queries (with explanations) for retrieving information from both the retrosheet and BDB data. I already knew some sql, but I learned a lot from those examples.

       -John Walsh



    • Mat Kovach
      ... I am presently writing an article that shows how to get the BDB information into PostgreSQL and load the Retrosheet data into MySQL and PostgreSQL. This
      Message 2 of 21 , May 18, 2006
      • 0 Attachment
        P Mondout wrote:

        > How cool would it be if we could come up with simple step by step
        > instructions (for MySQL, SQL Server, and Oracle, for example) for
        > importing Retrosheet and BDB data into these tools along with pointers
        > to similar step by step instructions to installing those databases and
        > their new front-ends themselves?

        I am presently writing an article that shows how to get the BDB
        information into PostgreSQL and load the Retrosheet data into MySQL and
        PostgreSQL. This is for a project I am working on
        (http://fungoes.mek.cc). The article will be Linux/UNIX specific but
        I'm sure some smart Windows person could do the same thing.

        > As to the question of Open Office's usability, I have never used it. I
        > have no reason to think Open Office's DB is anything less than first
        > rate.

        I have not loaded the data into Open Office, but I have used Open Office
        to connect to MySQL and PostgreSQL databases loaded with the data. It
        worked fine for me.

        Mat Kovach
      • Ben Matasar
        ... I think another option is SQLite[1], which is very fast and perfect for small queries. I did the conversion a couple months back, and have an SQLite
        Message 3 of 21 , May 18, 2006
        • 0 Attachment
          > I am presently writing an article that shows how to get the BDB
          > information into PostgreSQL and load the Retrosheet data into MySQL and
          > PostgreSQL. This is for a project I am working on
          > (http://fungoes.mek.cc). The article will be Linux/UNIX specific but
          > I'm sure some smart Windows person could do the same thing.

          I think another option is SQLite[1], which is very fast and perfect
          for small queries. I did the conversion a couple months back, and
          have an SQLite schema on my website:

          http://matasar.org/blog/baseball/sqlite?showcomments=yes

          I highly recommend SQLite for smallish scripting language queries --
          there's almost no configuration necessary.

          1: http://www.sqlite.org/

          Ben
        • Dereck L. Dietz
          I m an Oracle programmer analyst/DBA. If you d like to discuss this further feel free to contact me directly. ... From: P Mondout To:
          Message 4 of 21 , May 18, 2006
          • 0 Attachment
            I'm an Oracle programmer analyst/DBA.  If you'd like to discuss this further feel free to contact me directly.
            ----- Original Message -----
            From: P Mondout
            Sent: Thursday, May 18, 2006 10:35 AM
            Subject: [baseball-databank] Re: Access and BDB

            How cool would it be if we could come up with simple step by step
            instructions (for MySQL, SQL Server, and Oracle, for example) for
            importing Retrosheet and BDB data into these tools along with pointers
            to similar step by step instructions to installing those databases and
            their new front-ends themselves? I know the focus of this group is the
            data itself, and I would never want to suggest that focus should
            change, but I do think we would all be enriched by the resulting flood
            of research if everyone who writes about baseball had a copy a MySQL
            (or one of the others) with a good front-end loaded with BDB and
            Retrosheet data and a bunch of well-documented queries and
            instructions on how to make more. I'm aware of previous discussions
            about producing BDB in other than csv format and am not trying to
            restart that topic. Just seems like a worthwhile long-term goal since
            so many would love to have such a tool - even if they never used the
            database software for anything else - at their fingertips. I suppose
            all of this functionality could be offered by a website without
            requiring the researcher to know about or even have a database.
            Hmmm....

          • Tangotiger
            Someone else mentioned about using Access as a front-end. This is absolutely true, and yet another great useability feature of Access. It s a snap to make an
            Message 5 of 21 , May 18, 2006
            • 0 Attachment
              Someone else mentioned about using Access as a
              front-end. This is absolutely true, and yet another
              great useability feature of Access. It's a snap to
              make an ODBC connection to any database through
              Access. What's cool is that Access makes
              views/queries so usable, that you can store the data
              in Oracle, and then use Access as the front-end. I
              once even had to move data from one Oracle database to
              another, and Oracle was making it very complicated for
              me. So, I used Access as an intermediary. Access has
              great features, and should be the choice for 80% of
              the people on this list.

              Tom

              -----------------------------------------------
              THE BOOK -- Playing The Percentages In Baseball
              http://www.InsideTheBook.com











              -----------------------------------------------

              __________________________________________________
              Do You Yahoo!?
              Tired of spam? Yahoo! Mail has the best spam protection around
              http://mail.yahoo.com
            • Dereck L. Dietz
              How was Oracle making it complicated to move data from one Oracle database to another? ... From: Tangotiger To: baseball-databank@yahoogroups.com Sent:
              Message 6 of 21 , May 18, 2006
              • 0 Attachment
                How was Oracle making it complicated to move data from one Oracle database to another?
                ----- Original Message -----
                Sent: Thursday, May 18, 2006 9:04 PM
                Subject: Re: [baseball-databank] Re: Access and BDB

                Someone else mentioned about using Access as a
                front-end.  This is absolutely true, and yet another
                great useability feature of Access.  It's a snap to
                make an ODBC connection to any database through
                Access.  What's cool is that Access makes
                views/queries so usable, that you can store the data
                in Oracle, and then use Access as the front-end.  I
                once even had to move data from one Oracle database to
                another, and Oracle was making it very complicated for
                me.  So, I used Access as an intermediary. Access has
                great features, and should be the choice for 80% of
                the people on this list. 

                Tom

                -----------------------------------------------
                THE BOOK -- Playing The Percentages In Baseball
                http://www.InsideTheBook.com











                -----------------------------------------------

                __________________________________________________
                Do You Yahoo!?
                Tired of spam?  Yahoo! Mail has the best spam protection around
                http://mail.yahoo.com
              • Paul Wendt
                ... If you write 100 queries (and save them), how do you keep them straight? ... Is this like working with oracle all the time but using a mailreader most
                Message 7 of 21 , May 19, 2006
                • 0 Attachment
                  Tangotiger wrote, twice recently:

                  > What's cool is that Access makes views/queries so usable,

                  If you write 100 queries (and save them),
                  how do you keep them straight?


                  > I work with Oracle all the time, but I use Access most of the time.

                  Is this like "working" with oracle all the time
                  but "using" a mailreader most of the time?

                  :-) but I am curious what you mean.

                  Paul Wendt
                • Tangotiger
                  ... Ahhh... Access has a properties button, which you can put comments and documentation for each view, which you can then see. That column is also
                  Message 8 of 21 , May 20, 2006
                  • 0 Attachment
                    --- Paul Wendt <pgw02472@...> wrote:

                    > Tangotiger wrote, twice recently:
                    >
                    > > What's cool is that Access makes views/queries so
                    > usable,
                    >
                    > If you write 100 queries (and save them),
                    > how do you keep them straight?

                    Ahhh... Access has a "properties" button, which you
                    can put comments and documentation for each view,
                    which you can then see. That column is also sortable,
                    which is how I can keep my 100 queries all straight.

                    You can also set up separate databases, and instead of
                    "import table", you create an external link to those
                    tables.

                    >
                    >
                    > > I work with Oracle all the time, but I use Access
                    > most of the time.
                    >
                    > Is this like "working" with oracle all the time
                    > but "using" a mailreader most of the time?
                    >
                    > :-) but I am curious what you mean.
                    >

                    I work with Oracle outside of baseball. But use
                    Access most of the time for baseball. You seem to
                    think that life = baseball !

                    Tom


                    -----------------------------------------------
                    THE BOOK -- Playing The Percentages In Baseball
                    http://www.InsideTheBook.com











                    -----------------------------------------------

                    __________________________________________________
                    Do You Yahoo!?
                    Tired of spam? Yahoo! Mail has the best spam protection around
                    http://mail.yahoo.com
                  • Paul Wendt
                    ... I considered that once long ago; maybe I should consider again. Now I have Query names as two dimensions of organization because they are both
                    Message 9 of 21 , May 24, 2006
                    • 0 Attachment
                      Tangotiger <tangotiger@...> wrote:

                      > > Tangotiger wrote, twice recently:
                      > >
                      > > > What's cool is that Access makes views/queries so usable,
                      > >
                      > > If you write 100 queries (and save them),
                      > > how do you keep them straight?
                      >
                      > Ahhh... Access has a "properties" button, which you
                      > can put comments and documentation for each view,
                      > which you can then see. That column is also sortable,
                      > which is how I can keep my 100 queries all straight.

                      I considered that once long ago; maybe I should consider again.
                      Now I have Query names as two "dimensions" of organization
                      because they are both alphabetically sortable and readable.
                      I rejected Query descriptions, necessarily viewing only a small
                      subset of the names and descriptions in a list, in favor of
                      Query names only, optionally viewing all or many at once.
                      (The descriptions are like "from tmasc, RetroList 2003-01-01.")

                      With Query names and descriptions both alphabetically sortable
                      and readable, there are four dimensions for organization.
                      Of course, three might be used, leaving the fourth without any
                      organizational duties.

                      > You can also set up separate databases, and instead of
                      > "import table", you create an external link to those tables.

                      Thanks. This is clearly useful for some purposes,
                      not sure yet whether any of them are mine.


                      > > I work with Oracle all the time, but I use Access
                      > > most of the time.
                      > >
                      > > Is this like "working" with oracle all the time
                      > > but "using" a mailreader most of the time?
                      > >
                      > > :-) but I am curious what you mean.
                      >
                      > I work with Oracle outside of baseball. But use
                      > Access most of the time for baseball. You seem to
                      > think that life = baseball !

                      actually, "work" = baseball

                      That "most of the time" = baseball
                      we can take for granted!

                      Paul
                    • Tangotiger
                      ... Just to give you some ideas. When I was working on my project, I tackled many topics and subtopics. So, when I worked on Relievers and Leverage, I have a
                      Message 10 of 21 , May 24, 2006
                      • 0 Attachment
                        --- Paul Wendt <pgw02472@...> wrote:
                        > > You can also set up separate databases, and
                        > instead of
                        > > "import table", you create an external link to
                        > those tables.
                        >
                        > Thanks. This is clearly useful for some purposes,
                        > not sure yet whether any of them are mine.
                        >

                        Just to give you some ideas. When I was working on my
                        project, I tackled many topics and subtopics. So,
                        when I worked on Relievers and Leverage, I have a
                        separate database, and I create a link to some of the
                        tables from my master database. Then, I set up
                        queries in this Relievers_Leverage database, so I can
                        extract what I need. I added documentation to the
                        queries, so I knew which order to run them in:
                        "1a - Relievers"
                        "1b - Relief Aces"
                        or whathaveyou. Since that column is sortable, as
                        well as the query name, it's a snap to get what I
                        needed. As well, you can do "materialized views"
                        (create table) from the existing table for performance
                        reasons (applies more to Access than anywhere else).

                        By spinning off databases, all linked to the same
                        master, it makes it easier to manage.

                        Hope this helps...

                        Tom


                        -----------------------------------------------
                        THE BOOK -- Playing The Percentages In Baseball
                        http://www.InsideTheBook.com











                        -----------------------------------------------

                        __________________________________________________
                        Do You Yahoo!?
                        Tired of spam? Yahoo! Mail has the best spam protection around
                        http://mail.yahoo.com
                      • Robert Ehrlich
                        I am about to engage in an analysis of baseball stats. The database that I am using is Lahman-52. will be analyzing 1921 to 2004. Is this a decent choice for
                        Message 11 of 21 , Aug 11, 2006
                        • 0 Attachment
                          I am about to engage in an analysis of baseball stats. The database
                          that I am using is Lahman-52. will be analyzing 1921 to 2004.

                          Is this a decent choice for the data?

                          In that DB there are columns for singles, doubles, triples, etc. as well
                          as at-bats.

                          The sum of such columns (including various ways to be out) is not the
                          same as the number of at-bats. However the numbers do not include
                          decimal points--so I assume that they represent counts.

                          Should I divide those columns by the at-bats to generate a string of
                          numbers that will total to the batting average? That is, a set of
                          columns that will sum to unity and a subset (hits, etc.) that will sum
                          to the batting average?

                          We are planning to eliminate all records where the number of games were
                          less than 20.

                          Depending on the complexity of the output, we may eliminate pitchers
                          from the input so as not to waste a degree of freedom.

                          Lastly we would like to have the names of potential collaborators in
                          interpreting and writing up this data.

                          our analytical procedure is something called "Polytopic Vector Analysis"
                          (PVA) and we have all ready done a few trial runs with interesting results.

                          Bob Ehrlich
                        • Tangotiger
                          What is it that you are trying to accomplish? What is the question that you are seeking an answer for? Tom ... THE BOOK -- Playing The Percentages In Baseball
                          Message 12 of 21 , Aug 15, 2006
                          • 0 Attachment
                            What is it that you are trying to accomplish? What is
                            the question that you are seeking an answer for?

                            Tom

                            --- Robert Ehrlich <bobehrlich@...>
                            wrote:

                            >
                            > I am about to engage in an analysis of baseball
                            > stats. The database
                            > that I am using is Lahman-52. will be analyzing
                            > 1921 to 2004.
                            >
                            > Is this a decent choice for the data?
                            >
                            > In that DB there are columns for singles, doubles,
                            > triples, etc. as well
                            > as at-bats.
                            >
                            > The sum of such columns (including various ways to
                            > be out) is not the
                            > same as the number of at-bats. However the numbers
                            > do not include
                            > decimal points--so I assume that they represent
                            > counts.
                            >
                            > Should I divide those columns by the at-bats to
                            > generate a string of
                            > numbers that will total to the batting average?
                            > That is, a set of
                            > columns that will sum to unity and a subset (hits,
                            > etc.) that will sum
                            > to the batting average?
                            >
                            > We are planning to eliminate all records where the
                            > number of games were
                            > less than 20.
                            >
                            > Depending on the complexity of the output, we may
                            > eliminate pitchers
                            > from the input so as not to waste a degree of
                            > freedom.
                            >
                            > Lastly we would like to have the names of potential
                            > collaborators in
                            > interpreting and writing up this data.
                            >
                            > our analytical procedure is something called
                            > "Polytopic Vector Analysis"
                            > (PVA) and we have all ready done a few trial runs
                            > with interesting results.
                            >
                            > Bob Ehrlich
                            >
                            >
                            >


                            -----------------------------------------------
                            THE BOOK -- Playing The Percentages In Baseball
                            http://www.InsideTheBook.com











                            -----------------------------------------------

                            __________________________________________________
                            Do You Yahoo!?
                            Tired of spam? Yahoo! Mail has the best spam protection around
                            http://mail.yahoo.com
                          • Robert Ehrlich
                            Question: Can the data vectors for each player be used to create a classification of ball players? First the number of player-types has to be determined from
                            Message 13 of 21 , Aug 15, 2006
                            • 0 Attachment
                              Question:

                              Can the data vectors for each player be used to create a classification
                              of ball players?

                              First the number of player-types has to be determined from the data
                              matrix.

                              Then once determined, the data vector for each player type is calculated
                              ad the proportionate contribution of each type to each player is
                              calculated.

                              In a prior run for example we determined that Ralph Kiner (!!!) was the
                              best hitter in post war baseball just a little bit better than Ted
                              Williams or Joe DiMaggio. Many of the now-notorious current hitters
                              become suddenly a "new" type of player at some point in their careers.

                              If we just take batting stats alone each player-type is represented by a
                              set of hitting statistics. Each player is represented by the
                              contribution of each type to his hitting stats. We can then graph his
                              performance over time. Another interesting finding in the last run was
                              the uniqueness of Willy Mays with respect to triples.

                              I helped design this software (I'm a kind of statistician) and use it
                              several times a week for more boring applications in chemistry and
                              environmental science.
                              Though we would try another whack at baseball stats for fun.

                              Tangotiger wrote:

                              > What is it that you are trying to accomplish? What is
                              > the question that you are seeking an answer for?
                              >
                              > Tom
                              >
                              > --- Robert Ehrlich <bobehrlich@...
                              > <mailto:bobehrlich%40residuumenergy.com>>
                              > wrote:
                              >
                              > >
                              > > I am about to engage in an analysis of baseball
                              > > stats. The database
                              > > that I am using is Lahman-52. will be analyzing
                              > > 1921 to 2004.
                              > >
                              > > Is this a decent choice for the data?
                              > >
                              > > In that DB there are columns for singles, doubles,
                              > > triples, etc. as well
                              > > as at-bats.
                              > >
                              > > The sum of such columns (including various ways to
                              > > be out) is not the
                              > > same as the number of at-bats. However the numbers
                              > > do not include
                              > > decimal points--so I assume that they represent
                              > > counts.
                              > >
                              > > Should I divide those columns by the at-bats to
                              > > generate a string of
                              > > numbers that will total to the batting average?
                              > > That is, a set of
                              > > columns that will sum to unity and a subset (hits,
                              > > etc.) that will sum
                              > > to the batting average?
                              > >
                              > > We are planning to eliminate all records where the
                              > > number of games were
                              > > less than 20.
                              > >
                              > > Depending on the complexity of the output, we may
                              > > eliminate pitchers
                              > > from the input so as not to waste a degree of
                              > > freedom.
                              > >
                              > > Lastly we would like to have the names of potential
                              > > collaborators in
                              > > interpreting and writing up this data.
                              > >
                              > > our analytical procedure is something called
                              > > "Polytopic Vector Analysis"
                              > > (PVA) and we have all ready done a few trial runs
                              > > with interesting results.
                              > >
                              > > Bob Ehrlich
                              > >
                              > >
                              > >
                              >
                              > -----------------------------------------------
                              > THE BOOK -- Playing The Percentages In Baseball
                              > http://www.InsideTheBook.com <http://www.InsideTheBook.com>
                              >
                              > -----------------------------------------------
                              >
                              > __________________________________________________
                              > Do You Yahoo!?
                              > Tired of spam? Yahoo! Mail has the best spam protection around
                              > http://mail.yahoo.com <http://mail.yahoo.com>
                              >
                              >
                            • Tangotiger
                              If by classification you mean a profile or style of player, that s a good project. I look forward to seeing what you have. If you want some ideas as to how
                              Message 14 of 21 , Aug 15, 2006
                              • 0 Attachment
                                If by classification you mean a profile or style of
                                player, that's a good project. I look forward to
                                seeing what you have. If you want some ideas as to
                                how to make the classification, you can do things
                                like:
                                (BB+K)/PA
                                3b/(2b+3b)
                                sb/(1b*.8+bb*.6)

                                They each represent something specific. In the 2006
                                Hardball Times Annual, they have profiles based on GB,
                                FB, LD, etc, tendencies. This data is now available
                                at Fangraphs.com, back to 2002.

                                However, the "proportionate" contribution is a long
                                and tired road, long-traveled. I suggest reading what
                                is out there, rather than reinventing the wheel. I'd
                                start (and stop) with Linear Weights.

                                Tom

                                --- Robert Ehrlich <bobehrlich@...>
                                wrote:

                                > Question:
                                >
                                > Can the data vectors for each player be used to
                                > create a classification
                                > of ball players?
                                >
                                > First the number of player-types has to be
                                > determined from the data
                                > matrix.
                                >
                                > Then once determined, the data vector for each
                                > player type is calculated
                                > ad the proportionate contribution of each type to
                                > each player is
                                > calculated.
                                >
                                > In a prior run for example we determined that Ralph
                                > Kiner (!!!) was the
                                > best hitter in post war baseball just a little bit
                                > better than Ted
                                > Williams or Joe DiMaggio. Many of the now-notorious
                                > current hitters
                                > become suddenly a "new" type of player at some point
                                > in their careers.
                                >
                                > If we just take batting stats alone each player-type
                                > is represented by a
                                > set of hitting statistics. Each player is
                                > represented by the
                                > contribution of each type to his hitting stats. We
                                > can then graph his
                                > performance over time. Another interesting finding
                                > in the last run was
                                > the uniqueness of Willy Mays with respect to
                                > triples.
                                >
                                > I helped design this software (I'm a kind of
                                > statistician) and use it
                                > several times a week for more boring applications in
                                > chemistry and
                                > environmental science.
                                > Though we would try another whack at baseball stats
                                > for fun.
                                >
                                > Tangotiger wrote:
                                >
                                > > What is it that you are trying to accomplish? What
                                > is
                                > > the question that you are seeking an answer for?
                                > >
                                > > Tom
                                > >
                                > > --- Robert Ehrlich <bobehrlich@...
                                > > <mailto:bobehrlich%40residuumenergy.com>>
                                > > wrote:
                                > >
                                > > >
                                > > > I am about to engage in an analysis of baseball
                                > > > stats. The database
                                > > > that I am using is Lahman-52. will be analyzing
                                > > > 1921 to 2004.
                                > > >
                                > > > Is this a decent choice for the data?
                                > > >
                                > > > In that DB there are columns for singles,
                                > doubles,
                                > > > triples, etc. as well
                                > > > as at-bats.
                                > > >
                                > > > The sum of such columns (including various ways
                                > to
                                > > > be out) is not the
                                > > > same as the number of at-bats. However the
                                > numbers
                                > > > do not include
                                > > > decimal points--so I assume that they represent
                                > > > counts.
                                > > >
                                > > > Should I divide those columns by the at-bats to
                                > > > generate a string of
                                > > > numbers that will total to the batting average?
                                > > > That is, a set of
                                > > > columns that will sum to unity and a subset
                                > (hits,
                                > > > etc.) that will sum
                                > > > to the batting average?
                                > > >
                                > > > We are planning to eliminate all records where
                                > the
                                > > > number of games were
                                > > > less than 20.
                                > > >
                                > > > Depending on the complexity of the output, we
                                > may
                                > > > eliminate pitchers
                                > > > from the input so as not to waste a degree of
                                > > > freedom.
                                > > >
                                > > > Lastly we would like to have the names of
                                > potential
                                > > > collaborators in
                                > > > interpreting and writing up this data.
                                > > >
                                > > > our analytical procedure is something called
                                > > > "Polytopic Vector Analysis"
                                > > > (PVA) and we have all ready done a few trial
                                > runs
                                > > > with interesting results.
                                > > >
                                > > > Bob Ehrlich
                                > > >
                                > > >
                                > > >
                                > >
                                > > -----------------------------------------------
                                > > THE BOOK -- Playing The Percentages In Baseball
                                > > http://www.InsideTheBook.com
                                > <http://www.InsideTheBook.com>
                                > >
                                > > -----------------------------------------------
                                > >
                                > > __________________________________________________
                                > > Do You Yahoo!?
                                > > Tired of spam? Yahoo! Mail has the best spam
                                > protection around
                                > > http://mail.yahoo.com <http://mail.yahoo.com>
                                > >
                                > >
                                >
                                >


                                -----------------------------------------------
                                THE BOOK -- Playing The Percentages In Baseball
                                http://www.InsideTheBook.com











                                -----------------------------------------------

                                __________________________________________________
                                Do You Yahoo!?
                                Tired of spam? Yahoo! Mail has the best spam protection around
                                http://mail.yahoo.com
                              • Paul Wendt
                                ... Of course. ... As Tom Tango suggested, that reveals mixed motives. I believe that it must get in the way of classifying players --which you might do by
                                Message 15 of 21 , Aug 16, 2006
                                • 0 Attachment
                                  --- In baseball-databank@yahoogroups.com, Robert Ehrlich
                                  <bobehrlich@...> wrote:
                                  >
                                  > Question:
                                  > Can the data vectors for each player be used to create a
                                  > classification of ball players?

                                  Of course.

                                  > First the number of player-types has to be determined from the data
                                  > matrix.
                                  >
                                  > Then once determined, the data vector for each player type is
                                  > calculated and the proportionate contribution of each type to each
                                  > player is calculated.

                                  As Tom Tango suggested, that reveals mixed motives. I believe that it
                                  must get in the way of classifying players --which you might do by
                                  defining distance between player vectors/matrices.

                                  . . .
                                  > I helped design this software (I'm a kind of statistician) and use
                                  > it several times a week for more boring applications in chemistry
                                  > and environmental science.
                                  > Though we would try another whack at baseball stats for fun.

                                  I guess that you are inclined to denominate by at bats, ignoring bases
                                  on balls and hits by pitch, sacrifice bunts and flies, because popular
                                  baseball uses at bats. You should at least try denominating by plate
                                  appearances. I daresay at least 90% of this audience will scoff at
                                  definition of player-types by data matrices that exclude bases on
                                  balls and hits by pitch, if not lesser plate appearances.

                                  In my opinion, it makes sense to include fielding data in the
                                  analysis. Why not let the statistical(?) data analysis determine how
                                  the distance between games played at second and third base interacts
                                  with the distance between two-base and three-base hits?

                                  Paul Wendt

                                  P.S. I don't suppose you have the data to define kinds of
                                  statisticians --only kinds of molecules and perhaps ballplayers.
                                • Paul Wendt
                                  ... I guess that you are inclined to denominate by at bats, ignoring bases on balls and hits by pitch, sacrifice bunts and flies, because popular baseball uses
                                  Message 16 of 21 , Aug 16, 2006
                                  • 0 Attachment
                                    Moments ago, I wrote:
                                    >>
                                    I guess that you are inclined to denominate by at bats, ignoring bases
                                    on balls and hits by pitch, sacrifice bunts and flies, because popular
                                    baseball uses at bats. You should at least try denominating by plate
                                    appearances. I daresay at least 90% of this audience will scoff at
                                    definition of player-types by data matrices that exclude bases on
                                    balls and hits by pitch, if not lesser plate appearances.
                                    <<

                                    As more grist for the same mill, you should see the article by Jim
                                    Albert in a recent By The Numbers, newsletter of the SABR Statistical
                                    Analysis Cmte. Albert considers several rates that may be interpreted
                                    sequentially. The difference between plate appearances and at bats
                                    makes the first stage; strikeouts per atbat the second; home run rate
                                    per atbat-minus-strikeout the third.

                                    Anyway, consider whether than analytical method permits testing the
                                    significance of that first stage. Does it make sense to begin, as you
                                    are disposed to do, by simply throwing out the plate appearances that
                                    are not atbats.

                                    Paul Wendt
                                  • Robert Ehrlich
                                    Paul: Thanks for the advice--I will include the other stats. The hitting example was a feeble attempt to describe what I am doing. The procedure is robust in
                                    Message 17 of 21 , Aug 16, 2006
                                    • 0 Attachment
                                      Paul:

                                      Thanks for the advice--I will include the other stats. The hitting
                                      example was a feeble attempt to describe what I am doing.

                                      The procedure is robust in that it doesn't "care" if some of the columns
                                      are highly correlated. The home run rate / at bat etc. will definitely
                                      be in the data matrix and the other relationships should come out in
                                      the wash. Will keep my eyes open


                                      Appreciate the heads up. Will try to reach Jim Albert. Is the By the
                                      Numbers accessible from the web?

                                      Bob Ehrlich

                                      Paul Wendt wrote:

                                      > Moments ago, I wrote:
                                      > >>
                                      > I guess that you are inclined to denominate by at bats, ignoring bases
                                      > on balls and hits by pitch, sacrifice bunts and flies, because popular
                                      > baseball uses at bats. You should at least try denominating by plate
                                      > appearances. I daresay at least 90% of this audience will scoff at
                                      > definition of player-types by data matrices that exclude bases on
                                      > balls and hits by pitch, if not lesser plate appearances.
                                      > <<
                                      >
                                      > As more grist for the same mill, you should see the article by Jim
                                      > Albert in a recent By The Numbers, newsletter of the SABR Statistical
                                      > Analysis Cmte. Albert considers several rates that may be interpreted
                                      > sequentially. The difference between plate appearances and at bats
                                      > makes the first stage; strikeouts per atbat the second; home run rate
                                      > per atbat-minus-strikeout the third.
                                      >
                                      > Anyway, consider whether than analytical method permits testing the
                                      > significance of that first stage. Does it make sense to begin, as you
                                      > are disposed to do, by simply throwing out the plate appearances that
                                      > are not atbats.
                                      >
                                      > Paul Wendt
                                      >
                                      >
                                    • Tangotiger
                                      ... Voros first described this process when he developed DIPS. He would break up the stat line into binary components: HBP, no HBP. Of the no HBP, walk or no
                                      Message 18 of 21 , Aug 16, 2006
                                      • 0 Attachment
                                        --- Paul Wendt <pgw@...> wrote:
                                        > As more grist for the same mill, you should see the
                                        > article by Jim
                                        > Albert in a recent By The Numbers, newsletter of the
                                        > SABR Statistical
                                        > Analysis Cmte. Albert considers several rates that
                                        > may be interpreted
                                        > sequentially. The difference between plate
                                        > appearances and at bats
                                        > makes the first stage; strikeouts per atbat the
                                        > second; home run rate
                                        > per atbat-minus-strikeout the third.

                                        Voros first described this process when he developed
                                        DIPS. He would break up the stat line into binary
                                        components: HBP, no HBP. Of the no HBP, walk or no
                                        walk. Of the no HBP, no BB, K or no K, and so on.

                                        This process, or one like it, is a great way to try to
                                        get to a player's skillset.

                                        Tom

                                        -----------------------------------------------
                                        THE BOOK -- Playing The Percentages In Baseball
                                        http://www.InsideTheBook.com











                                        -----------------------------------------------

                                        __________________________________________________
                                        Do You Yahoo!?
                                        Tired of spam? Yahoo! Mail has the best spam protection around
                                        http://mail.yahoo.com
                                      • Paul Wendt
                                        Kristin Campbell is a colleague, I suppose? ... The particular article by Jim Albert, advocating four rates understood sequentially,
                                        Message 19 of 21 , Aug 16, 2006
                                        • 0 Attachment
                                          Kristin Campbell <kcampbell53@...> is a colleague, I suppose?

                                          Rob Ehrlich wrote:
                                          > Thanks for the advice--I will include the other stats. The hitting
                                          > example was a feeble attempt to describe what I am doing.
                                          >
                                          > The procedure is robust in that it doesn't "care" if some of the columns
                                          > are highly correlated. The home run rate / at bat etc. will definitely
                                          > be in the data matrix and the other relationships should come out in
                                          > the wash. Will keep my eyes open
                                          >
                                          > Appreciate the heads up. Will try to reach Jim Albert. Is the By the
                                          > Numbers accessible from the web?

                                          The particular article by Jim Albert, advocating four rates understood
                                          sequentially, was recently published in BTN. Phil Birnbaum maintains a
                                          web archive covering his tenure as editor. See philbirnbaum.com


                                          Tom Tango wrote and I elliptcially endorsed:
                                          >>
                                          However, the "proportionate" contribution is a long and tired road,
                                          long-traveled. I suggest reading what is out there, rather than
                                          reinventing the wheel. I'd start (and stop) with Linear Weights.
                                          <<

                                          Upon rereading the original:
                                          That comment may rest on a misinterpretation of the contribution
                                          concept, which may be the contribution of ideal types to actual players.
                                          For example, Mike Hargrove is 39% Walkmeister, 3% Slugger, 1%
                                          Twinkletoes, etc, where Walkmeister, Slugger, Twinkletoes etc, are the
                                          ideal types --perhaps literally a few condensation points in the space.

                                          I endorse the suggestion to spend little time identifying
                                          "contributions" to team runs scored, runs saved, and win-loss decisions.
                                          Postpone that for a year.

                                          Paul Wendt
                                        • Paul Wendt
                                          ... Based on what I know of the project, I might have guessed 1939 because in the Batting table I find 26106 records with null GIDP (ground into double play)
                                          Message 20 of 21 , Aug 16, 2006
                                          • 0 Attachment
                                            11 Aug 2006, Robert Ehrlich wrote:

                                            > I am about to engage in an analysis of baseball stats. The database
                                            > that I am using is Lahman-52. will be analyzing 1921 to 2004.
                                            >
                                            > Is this a decent choice for the data?

                                            Based on what I know of the project, I might have guessed 1939 because
                                            in the Batting table I find 26106 records with null GIDP (ground into
                                            double play) among the 28102 records with yearID < 1939. I suppose that
                                            GIDP beongs in the classification of players by type, but others must
                                            suppose the same about CS, IBB, and SF, for which we have null data some
                                            for time for some time after 1939.

                                            Discussion by this group throughout its history has covered the
                                            crucial and painful matters, let me name them, "Null or Zero?" and
                                            "Elegant Variation in the representation of null data". Those matters
                                            mainly plague older records. Check the archive, perhaps searching for
                                            'null', and you will find some corrections to lahman52.

                                            Crucial and painful matters aside, there is the problem that nulls in
                                            the database represent both missing data (unknown) and category
                                            anachronisms such as first base on hit by pitch (HBP) before AA1884 and
                                            NL1887. I don't know when between 1887 and 1950 this problem becomes
                                            trivial.

                                            > In that DB there are columns for singles, doubles, triples, etc. as
                                            > well as at-bats.
                                            >
                                            > The sum of such columns (including various ways to be out) is not the
                                            > same as the number of at-bats. However the numbers do not include
                                            > decimal points--so I assume that they represent counts.

                                            Yes, although the lahman database includes some derived "averages" too.

                                            Paul Wendt
                                          Your message has been successfully submitted and would be delivered to recipients shortly.