Loading ...
Sorry, an error occurred while loading the content.
 

Proposal: Convert current db back to solely numeric keys and use this as the base db

Expand Messages
  • Sean Forman
    I want to get the ball rolling here. I wanted to put this out for public discussion to hash out issues and come to a consensus and then put the board to use
    Message 1 of 20 , Jan 24, 2006
      I want to get the ball rolling here. I wanted to put this out for
      public discussion to hash out issues and come to a consensus and then
      put the board to use by voting to approve or disapprove this proposal.


      Proposal:
      First off, I apologize for the confusion that my premature release of
      a numeric key db in August 2005 caused. I put aside some time to work
      on the db and wanted to show people what I had and get some feedback.
      I now realize that was not the most constructive thing to do and I
      should have made clear that it was a branch of the current db.

      That said, this proposal is that we should create version 2005.1 using
      the current data (with recent non-controversial errata included) in
      the form of a solely numeric key db along the lines of the August 2005
      db. In fact, I'll propose that the August 2005 db is the framework we
      use (incorporating helpful suggestions) and then move forward on the
      addition and correction of data from there based on that dataset.

      Since we've never done this before, I'm proposing discussion for the
      period of one week and then the board will vote from Tues. Jan. 31 to
      Feb. 3. With at least four yes votes and no no votes signalling the
      proposal passes. I'd love for it to be unanimous. I'm happy to
      discuss on or off-list if anyone is concerned about the proposed
      process.

      Board and non-board members are welcome to discuss this proposal.

      Sean Forman

      --
      Sincerely,
      Sean Forman

      Baseball Stats! http://www.Baseball-Reference.com/
    • dsreyn
      Speaking as a user of the CSV version of the database, my preference is to stay with the current format (non-numeric keys) for the release versions. I gather
      Message 2 of 20 , Jan 24, 2006
        Speaking as a user of the CSV version of the database, my preference
        is to stay with the current format (non-numeric keys) for the release
        versions. I gather that numeric keys may have some advantages for
        those who actually maintain the database, but I'm not sure I see the
        value added for end users. Among other things, I find it very useful
        from time to time to be able to read the "raw" data directly, and
        purely numeric keys will make that much more difficult.

        Doug

        --- In baseball-databank@yahoogroups.com, Sean Forman
        <sean-forman@b...> wrote:
        >
        > I want to get the ball rolling here. I wanted to put this out for
        > public discussion to hash out issues and come to a consensus and then
        > put the board to use by voting to approve or disapprove this proposal.
        >
        >
        > Proposal:
        > First off, I apologize for the confusion that my premature release of
        > a numeric key db in August 2005 caused. I put aside some time to work
        > on the db and wanted to show people what I had and get some feedback.
        > I now realize that was not the most constructive thing to do and I
        > should have made clear that it was a branch of the current db.
        >
        > That said, this proposal is that we should create version 2005.1 using
        > the current data (with recent non-controversial errata included) in
        > the form of a solely numeric key db along the lines of the August 2005
        > db. In fact, I'll propose that the August 2005 db is the framework we
        > use (incorporating helpful suggestions) and then move forward on the
        > addition and correction of data from there based on that dataset.
        >
        > Since we've never done this before, I'm proposing discussion for the
        > period of one week and then the board will vote from Tues. Jan. 31 to
        > Feb. 3. With at least four yes votes and no no votes signalling the
        > proposal passes. I'd love for it to be unanimous. I'm happy to
        > discuss on or off-list if anyone is concerned about the proposed
        > process.
        >
        > Board and non-board members are welcome to discuss this proposal.
        >
        > Sean Forman
        >
        > --
        > Sincerely,
        > Sean Forman
        >
        > Baseball Stats! http://www.Baseball-Reference.com/
        >
      • Tangotiger
        At one point, perhaps still the current point, the mission statement of this group was to produce a DB for the DB developer, and not the end-user. The DB
        Message 3 of 20 , Jan 24, 2006
          At one point, perhaps still the current point, the
          mission statement of this group was to produce a DB
          for the DB developer, and not the end-user. The DB
          developers would then be charged to produce a
          user-friendly database (like Lahman).

          To that end, all design principles, data sources, and
          requirements need to be sorted out, so that when BDB,
          Release 1 is finally released, it will be a stable
          design, one which can be expanded with ease, with
          minimal changes. Numeric keys, and other design
          considerations would go into here.

          And once that happens, the DB developers will take
          this database will produce a truly remarkable database
          in all forms (web, db, palm, whatever) for the
          end-user, one where the end-user is not going to try
          to figure out how to join tables, or worry about key
          changes, or design changes.

          These two steps have to go hand-in-hand, so that no
          one will be left out in the lurches. (Seamless
          transition.)

          I also think that until we get something stable, the
          end-user, right now, would simply prefer to get an
          updated list of data, say batting2005.csv, that he can
          simply import into his MS Access or MySQL database,
          without needing to worry that "things have changed".
          I think Lahman recently said that he prefers an
          "overhaul" every 3 or 4 years, and I agree.

          Tom


          __________________________________________________
          Do You Yahoo!?
          Tired of spam? Yahoo! Mail has the best spam protection around
          http://mail.yahoo.com
        • Keith Hemmelman
          From my standpoint as an end user type of position, I work with the CSV version of the files and bring them into Microsoft Access to create forms, queries,
          Message 4 of 20 , Jan 24, 2006
            From my standpoint as an 'end user' type of position, I work with the CSV version of the files and bring them into Microsoft Access to create forms, queries, reports, etc. for me to use.  I assume you are referring to the 'public' release (i.e. CSV files) of the data with this change in format.  Thus, my comments are based on this.
             
            I can work with either format.  But my main concern is that once a format is decided upon, it remain in place.  I spent quite a bit of time working on the August version and then stopped and started over with the December version and it appears that perhaps I may start over again.  I'm not totally helpless, but my skills in MS Access are not what I wished they were, so to have to deal with a large change in format each time a new version of the database is released would be an extreme inconvenience for me.  Certainly it's not something I think I can't overcome, but I'd like to get to a point where instead of spending all my waking hours working on the database, I can start using it.
             
            With regards to the August format, as I recall, it seemed to be easy for me to work with, but no more so than the current format.  I did kind of like the format of the August release with that numeric key which seemed to be a consistent primary type key across tables, so if there are advantages for you to use the August style format or if it's just more logical for most folks to work with, then it makes sense to me why the format should change, but hopefully once a decision has been made on a format, it will remain in place.
             
            No matter which format if used though, I appreciate your and everyone else's efforts in making this data available.  For a baseball fan, being able to have access to this data is more than I could have hoped for, so thank you.
             
            Keith Hemmelman
            -----Original Message-----
            From: baseball-databank@yahoogroups.com [mailto:baseball-databank@yahoogroups.com]On Behalf Of Sean Forman
            Sent: Tuesday, January 24, 2006 7:55 PM
            To: Baseball Databank; BDB-board@yahoogroups.com
            Subject: [baseball-databank] Proposal: Convert current db back to solely numeric keys and use this as the base db

            I want to get the ball rolling here.  I wanted to put this out for
            public discussion to hash out issues and come to a consensus and then
            put the board to use by voting to approve or disapprove this proposal.


            Proposal:
            First off, I apologize for the confusion that my premature release of
            a numeric key db in August 2005 caused.  I put aside some time to work
            on the db and wanted to show people what I had and get some feedback.
            I now realize that was not the most constructive thing to do and I
            should have made clear that it was a branch of the current db.

            That said, this proposal is that we should create version 2005.1 using
            the current data (with recent non-controversial errata included) in
            the form of a solely numeric key db along the lines of the August 2005
            db.  In fact, I'll propose that the August 2005 db is the framework we
            use (incorporating helpful suggestions) and then move forward on the
            addition and correction of data from there based on that dataset.

            Since we've never done this before, I'm proposing discussion for the
            period of one week and then the board will vote from Tues. Jan. 31 to
            Feb. 3.  With at least four yes votes and no no votes signalling the
            proposal passes.  I'd love for it to be unanimous.  I'm happy to
            discuss on or off-list if anyone is concerned about the proposed
            process.

            Board and non-board members are welcome to discuss this proposal.

            Sean Forman

            --
            Sincerely,
            Sean Forman

            Baseball Stats!   http://www.Baseball-Reference.com/
          • John Walsh
            Hi Sean, A brief note on how I use the DB, so you understand my viewpoint. I m also an end-user and not a DB designer/maintainer. I have been using the DB for
            Message 5 of 20 , Jan 25, 2006
              Hi Sean,

              A brief note on how I use the DB, so you understand my viewpoint. I'm also an end-user and not a DB designer/maintainer.  I have been using the DB for a few years now, typically I grab the mysql version from the baseball-databank.org site and simply load the whole thing into mysql. To do my analysis, I run sql queries on the tables, either "directly" using the mysql interface or via perl scripts.

              I've looked a little at the new format and one thing that struck me right away is that what were "quick and dirty" queries on the current format, are not so quick in the new format. For example, let's say I want a list of all 50 HR seasons, with player, team and year information. I can do that easily like this:

              select playerid, teamid, yearid, hr from Batting where hr>=50;

              This will give me the list I want and I can figure out who's who from the playerid. In the new format, to get the same list with the same info I'd have to do a query that includes 4 different tables. That's because the playerid, the year and the team id are not stored in the Batting table, but in the Master, Teams and TeamsFranchises tables, respectively.(Actually, I no longer see a team ID in the new format, only a team "name", which is not so useful and a franchise ID).  The resulting "triple join" query (which I won't try to produce here) is a lot more complicated and much slower than the query above. [Caveat, I'm not a true sql expert, so if I'm being inaccurate here, I'd love for somebody to contradict me. Send my 50 hr query for the new format.]

              Now, Tango is talking about a joint effort where DB designers go nuts with numerical keys and front-end providers make everything right again for the end-users. If that's the case and there are people to do both jobs, then fine. However, if in the end, the end-user is forced to join together 4 tables to get a list of 50 hr seasons, then I think the usefulness of the LahmanDB will have been considerably diminished.

               -John Walsh


              On 1/25/06, Sean Forman <sean-forman@...> wrote:
              I want to get the ball rolling here.  I wanted to put this out for
              public discussion to hash out issues and come to a consensus and then
              put the board to use by voting to approve or disapprove this proposal.


              Proposal:
              First off, I apologize for the confusion that my premature release of
              a numeric key db in August 2005 caused.  I put aside some time to work
              on the db and wanted to show people what I had and get some feedback.
              I now realize that was not the most constructive thing to do and I
              should have made clear that it was a branch of the current db.

              That said, this proposal is that we should create version 2005.1 using
              the current data (with recent non-controversial errata included) in
              the form of a solely numeric key db along the lines of the August 2005
              db.  In fact, I'll propose that the August 2005 db is the framework we
              use (incorporating helpful suggestions) and then move forward on the
              addition and correction of data from there based on that dataset.

              Since we've never done this before, I'm proposing discussion for the
              period of one week and then the board will vote from Tues. Jan. 31 to
              Feb. 3.  With at least four yes votes and no no votes signalling the
              proposal passes.  I'd love for it to be unanimous.  I'm happy to
              discuss on or off-list if anyone is concerned about the proposed
              process.

              Board and non-board members are welcome to discuss this proposal.

              Sean Forman

              --
              Sincerely,
              Sean Forman

              Baseball Stats!   http://www.Baseball-Reference.com/


              http://www.baseball-databank.org/



              SPONSORED LINKS
              Mlb baseball Major league baseball Youth baseball
              Major league baseball ticket Baseball equipment


              YAHOO! GROUPS LINKS




            • Sean Forman
              Is this a vote for or against or an abstention? I agree that we need to (at some times) take a high level approach, but what I m saying now is what is the
              Message 6 of 20 , Jan 25, 2006
                Is this a vote for or against or an abstention?

                I agree that we need to (at some times) take a high level approach,
                but what I'm saying now is what is the first step from getting from a
                to b.

                I agree that this isn't an end user question and we should support
                them as best we can. The question is from now until release 1
                (opening day? all-star break?) what is the first step to get us on the
                path.

                The numeric key question has been a contentious one all along and I'm
                proposing that we settle that first, produce a prototype and then
                begin ironing out issues from there.

                The developers who are providing services to the end users are going
                to have to understand that we have a usable release already out and
                that between now and the end of the 2006 season there is going to be a
                lot of flux and they rely on a version at their peril.

                sean

                On 1/24/06, Tangotiger <tangotiger@...> wrote:
                > At one point, perhaps still the current point, the
                > mission statement of this group was to produce a DB
                > for the DB developer, and not the end-user. The DB
                > developers would then be charged to produce a
                > user-friendly database (like Lahman).
                >
                > To that end, all design principles, data sources, and
                > requirements need to be sorted out, so that when BDB,
                > Release 1 is finally released, it will be a stable
                > design, one which can be expanded with ease, with
                > minimal changes. Numeric keys, and other design
                > considerations would go into here.
                >
                > And once that happens, the DB developers will take
                > this database will produce a truly remarkable database
                > in all forms (web, db, palm, whatever) for the
                > end-user, one where the end-user is not going to try
                > to figure out how to join tables, or worry about key
                > changes, or design changes.
                >
                > These two steps have to go hand-in-hand, so that no
                > one will be left out in the lurches. (Seamless
                > transition.)
                >
                > I also think that until we get something stable, the
                > end-user, right now, would simply prefer to get an
                > updated list of data, say batting2005.csv, that he can
                > simply import into his MS Access or MySQL database,
                > without needing to worry that "things have changed".
                > I think Lahman recently said that he prefers an
                > "overhaul" every 3 or 4 years, and I agree.
                >
                > Tom
                >
                >
                > __________________________________________________
                > Do You Yahoo!?
                > Tired of spam? Yahoo! Mail has the best spam protection around
                > http://mail.yahoo.com
                >
                >
                >
                > http://www.baseball-databank.org/
                >
                >
                >
                > ________________________________
                > YAHOO! GROUPS LINKS
                >
                >
                > Visit your group "baseball-databank" on the web.
                >
                > To unsubscribe from this group, send an email to:
                > baseball-databank-unsubscribe@yahoogroups.com
                >
                > Your use of Yahoo! Groups is subject to the Yahoo! Terms of Service.
                >
                > ________________________________
                >


                --
                Sincerely,
                Sean Forman

                Baseball Stats! http://www.Baseball-Reference.com/
              • Joseph Adler
                Hi Sean, I think there are actually a lot of different possibilities here. I think we can break this down into a few different issues. The Baseball Databank
                Message 7 of 20 , Jan 25, 2006
                  Hi Sean,

                  I think there are actually a lot of different possibilities here. I
                  think we can break this down into a few different issues.

                  The Baseball Databank team maintains a master database from which all
                  other files are derived. The first question we have is this: what form
                  should this database take: numeric key, or other (depending on table,
                  some combination of playerID, yearID, teamID, and other fields)?

                  The second question is about formats released to the public. In what
                  formats should the database be released to the public? There is
                  clearly demand for a version of the database as a MySQL dump file and
                  as text files, and there are some users who like each format. It
                  sounds like numeric keys are not great for text file users, but less
                  of an issue for database users.

                  The final question is about incremental updates vs. full dumps. It
                  sounds like there might be some users who would like incremental
                  upgrades to the database between years, probably containing just new
                  records for different tables and maybe some corrected records. Are
                  there users who would like incremental updates?

                  Here's my take on the these issues. I think that the core team should
                  work in whichever form they find easiest. I think that a normalized,
                  numberic key database is probably best (and think that it would be
                  even better with some foreign key constraints), because it's easiest
                  to keep the database consistent in this format.

                  I also think that it would be helpful to continue the current practice
                  of releasing versions of the databse in multiple formats. Perhaps the
                  best choice is to release the database as a set of text files, a MySQL
                  dump of the numeric key database, and a MySQL dump of the traditional
                  format database.

                  I posted a script file recently for building a numeric key database
                  from the current file. It should be straightforward to write a short
                  SQL script to do the opposite. So, if you're open to posting files in
                  mutiple formats, I think there is a straightforward way to create
                  these files without creating extra work.

                  -- Joe

                  --- In baseball-databank@yahoogroups.com, Sean Forman
                  <sean-forman@b...> wrote:
                  >
                  > I want to get the ball rolling here. I wanted to put this out for
                  > public discussion to hash out issues and come to a consensus and then
                  > put the board to use by voting to approve or disapprove this proposal.
                  >
                  >
                  > Proposal:
                  > First off, I apologize for the confusion that my premature release of
                  > a numeric key db in August 2005 caused. I put aside some time to work
                  > on the db and wanted to show people what I had and get some feedback.
                  > I now realize that was not the most constructive thing to do and I
                  > should have made clear that it was a branch of the current db.
                  >
                  > That said, this proposal is that we should create version 2005.1 using
                  > the current data (with recent non-controversial errata included) in
                  > the form of a solely numeric key db along the lines of the August 2005
                  > db. In fact, I'll propose that the August 2005 db is the framework we
                  > use (incorporating helpful suggestions) and then move forward on the
                  > addition and correction of data from there based on that dataset.
                  >
                  > Since we've never done this before, I'm proposing discussion for the
                  > period of one week and then the board will vote from Tues. Jan. 31 to
                  > Feb. 3. With at least four yes votes and no no votes signalling the
                  > proposal passes. I'd love for it to be unanimous. I'm happy to
                  > discuss on or off-list if anyone is concerned about the proposed
                  > process.
                  >
                  > Board and non-board members are welcome to discuss this proposal.
                  >
                  > Sean Forman
                  >
                  > --
                  > Sincerely,
                  > Sean Forman
                  >
                  > Baseball Stats! http://www.Baseball-Reference.com/
                  >
                • tjruane
                  ... For what it s worth (and I admit that may not be much), I think converting to strictly numeric keys is a big mistake. I maintain a (sort of) data base for
                  Message 8 of 20 , Jan 25, 2006
                    Sean Forman wrote:

                    > That said, this proposal is that we should create version
                    > 2005.1 using the current data (with recent non-controversial
                    > errata included) in the form of a solely numeric key db along
                    > the lines of the August 2005 db. In fact, I'll propose that
                    > the August 2005 db is the framework we use (incorporating
                    > helpful suggestions) and then move forward on the
                    > addition and correction of data from there based on that
                    > dataset.

                    For what it's worth (and I admit that may not be much), I think
                    converting to strictly numeric keys is a big mistake. I maintain a
                    (sort of) data base for Retrosheet's use and can't imagine how much
                    more difficult and error-prone this would be with numeric keys. Now,
                    I should perhaps emphasize the "sort of" in the previous sentence
                    since, apart from a few college courses a long time ago, I have had
                    little exposure to or interest in classic data base design and view
                    Retrosheet's data from a purely pragmatic view. In short, I want
                    something that is easy to maintain and understand. So my data evolved
                    into a simple set of comma-delimited records. I know what you're
                    thinking: "amateur" and I agree. Nevertheless, it is much easier to
                    correct and add to this data with our current keys than it would be if
                    they conveyed no information at all. Having said that, I am not the
                    one who will be maintaining your data, so my opinion probably
                    shouldn't count for all that much.

                    Tom Ruane
                  • Holmes, Dan
                    As a developer, I use the database much differently than an end-user. I understand the need for the database to be truly relational and the need for numeric
                    Message 9 of 20 , Jan 25, 2006
                      As a developer, I use the database much differently than an end-user. I
                      understand the need for the database to be truly relational and the need
                      for numeric keys. I suggest we move in that direction, as soon as
                      possible.

                      Although I understand that the end-user is the ultimate audience here, I
                      think that the integrity and performance of the database is most
                      critical. If it's between making things easier for those who actually
                      maintain the DB and take it and create an end-product (website, mobile
                      data, XML, spreadsheets, etc.) and making things easier for the
                      end-user, it should be the former, not the latter.

                      The DB is FREE - and in my opinion, the least the end-user can do is
                      learn a little about querying the DB in order to use it. A little
                      knowledge on how to use MS Access or SQL isn't a bad thing, after all.

                      To add more to the discussion: I have my own "version" of the DB that I
                      maintain. At the end of each season, I get the updated stats and import
                      them into my own framework, which contains queries and tables that I've
                      developed to answer questions I'm interested in. This is largely based
                      on the raw data available free to everyone. It would improve the
                      performance of the DB to have numeric keys and a truly relational DB.
                      I've actually done some of that myself in my version of the DB. I've
                      contributed all of my own raw data (hitting streaks and milestones, for
                      example) to the common pool. People are free to download that, if they
                      wish. Every year I import the new data and my queries are fine, with
                      some tweaking here and there. I don't think I should demand that the
                      group provide the data to me in a format that makes this updating
                      process easier. I think the onus is on me to use the data and develop
                      with it in ways I want.

                      That's my more than two cents.

                      FYI - some of the queries and data I have developed are available at
                      http://www.thebaseballpage.com/players/
                      Just enter a player name in the Player Page search box or click on a
                      player name from that page. Choose his Player Stats link from that page,
                      and you'll see how I've used the DB. A direct link, for example:
                      http://www.thebaseballpage.com/players/stats/henderi01.php shows Rickey
                      Henderson's career stats.

                      Dan Holmes
                    • KJOK
                      Another 2 cents... At this point, I always like to point out that only addressing the groups end-users and developers is a bit simplistic. There is at
                      Message 10 of 20 , Jan 25, 2006
                        Another 2 cents...

                        At this point, I always like to point out that only addressing the
                        groups "end-users" and "developers" is a bit simplistic. There is
                        at least a 3rd group I would call "data contributors", who
                        contribute corrections, new data, new research, etc. There is
                        probably even a 4th group, a end-user/developer hybrid that I'll
                        call "analysts" for simplicity, that simply want to be able to use
                        the data themselves to do their own queries instead of relying on
                        someone creating a website or tool to provide data displays, but are
                        not necessarily interested in creating such tools or websites
                        themselves.

                        As I consider myself a member of these group #'s 3 & 4, I haven't
                        been very excited about the move to numeric keys, because as others
                        have pointed out it makes maintenance a little more difficult, and
                        sometimes makes writing queries more cumbersome. It may also make
                        it slightly more difficult to 'cross-link' to other data sets, such
                        as Retrosheet for an example.

                        However, I think the key thought is this posted comment:

                        "I think that a normalized, numberic key database is probably best
                        (and think that it would be even better with some foreign key
                        constraints), because it's easiest to keep the database consistent
                        in this format."

                        I believe I understand enough about normalization to realize that
                        the benefits outlined in the above comment are going to outweigh the
                        minor inconveniences of not having the more 'user-friendly' keys.
                        There are ways around not having the "old" keys - for players, for
                        exmample, it could be simply using "Lastname", "firstname", "DOB",
                        fields whenever doing maintenance or creating new data.

                        I WOULD strongly request that for some tables, such as Teams or
                        Leagues, that we not "throw away" what are today keys, such as "SLN"
                        for St. Louis Cardinals, or "AA" for American Assocation, but
                        instead change those fields from keys to "abbreviation" or something
                        similar, as I think in those cases it's worth keeping both the
                        current key and adding the new numeric keys.


                        To try to wrap up, what I see as the concerns about moving to
                        numeric keys are the ability of End-Users, Data Contributors, and
                        Analysts to do things such as:


                        1. Maintain the existing data, and easily incorporate changes,
                        corrections, and new information.

                        2. Be able to manipulate and analyze the data with queries that are
                        not too complex for non-developers to write.

                        3. Tie together other data sets such as Retrosheet, or Minor League
                        data, or new research, with the BDB dataset, such as if we have
                        Satchel Paige's Negro League stats, for example, to be able to
                        combine them with his major league stats into one player "page".


                        As I envision the new version, I believe we can move to numeric keys
                        AND still be able to address these concerns.

                        THANKS,
                        Kevin Johnson


                        --- In baseball-databank@yahoogroups.com, Sean Forman <sean-
                        forman@b...> wrote:
                        >
                        > I want to get the ball rolling here. I wanted to put this out for
                        > public discussion to hash out issues and come to a consensus and
                        then
                        > put the board to use by voting to approve or disapprove this
                        proposal.
                        >
                        >
                        > Proposal:
                        > First off, I apologize for the confusion that my premature release
                        of
                        > a numeric key db in August 2005 caused. I put aside some time to
                        work
                        > on the db and wanted to show people what I had and get some
                        feedback.
                        > I now realize that was not the most constructive thing to do and I
                        > should have made clear that it was a branch of the current db.
                        >
                        > That said, this proposal is that we should create version 2005.1
                        using
                        > the current data (with recent non-controversial errata included) in
                        > the form of a solely numeric key db along the lines of the August
                        2005
                        > db. In fact, I'll propose that the August 2005 db is the
                        framework we
                        > use (incorporating helpful suggestions) and then move forward on
                        the
                        > addition and correction of data from there based on that dataset.
                        >
                        > Since we've never done this before, I'm proposing discussion for
                        the
                        > period of one week and then the board will vote from Tues. Jan. 31
                        to
                        > Feb. 3. With at least four yes votes and no no votes signalling
                        the
                        > proposal passes. I'd love for it to be unanimous. I'm happy to
                        > discuss on or off-list if anyone is concerned about the proposed
                        > process.
                        >
                        > Board and non-board members are welcome to discuss this proposal.
                        >
                        > Sean Forman
                        >
                        > --
                        > Sincerely,
                        > Sean Forman
                        >
                        > Baseball Stats! http://www.Baseball-Reference.com/
                        >
                      • Tangotiger
                        The question of whether to go to numeric keys or not is not really the issue. This is going to happen eventually. It s not if, but when (1 year or 10 years).
                        Message 11 of 20 , Jan 25, 2006
                          The question of whether to go to numeric keys or not
                          is not really the issue. This is going to happen
                          eventually. It's not if, but when (1 year or 10
                          years).

                          The real question is how to get from "a" (current
                          database) to "z" (the "good" database). With our
                          current setup (volunteer), we've got to get to step b,
                          to step c, etc. Numeric keys is one of the steps.
                          There's a whole set of steps to follow. Those have
                          yet to be identified.

                          As KJOK was quick to remind us, there are actually
                          other groups involved, notably the data providers. We
                          need to (eventually) integrate with (or absorb) them.

                          The other question being asked (or should be asked) is
                          when to rollout these releases. In my opinion, until
                          you have requirements, analysis, and design, you can't
                          do an implementation. This means, we should maintain
                          status quo (just provide incremental data updates to
                          the existing data structure).

                          Tom


                          __________________________________________________
                          Do You Yahoo!?
                          Tired of spam? Yahoo! Mail has the best spam protection around
                          http://mail.yahoo.com
                        • parked43
                          I m an end user but am familiar with developing small databases. One DB I use holds data from a baseball simulation game that I play. I use numeric keys for
                          Message 12 of 20 , Jan 25, 2006
                            I'm an end user but am familiar with developing small databases. One
                            DB I use holds data from a baseball simulation game that I play. I
                            use numeric keys for player, team and league data. In the case of the
                            databank, I prefer the non-numeric keys but I really don't care what
                            are used for keys. But if the database is switched to numeric key
                            then don't change them with each new release - for instance if Mickey
                            Mantle is player 11050 this year then don't make him player 11100
                            next year because fifty of baseball's rookies had names that were
                            alphabetically ahead of Mantle. Start the rookies with the next
                            number in line.

                            Don Parke...

                            --- In baseball-databank@yahoogroups.com, "KJOK" <kjokbaseball@y...>
                            wrote:
                            >
                            > Another 2 cents...
                            >
                            > At this point, I always like to point out that only addressing the
                            > groups "end-users" and "developers" is a bit simplistic. There is
                            > at least a 3rd group I would call "data contributors", who
                            > contribute corrections, new data, new research, etc. There is
                            > probably even a 4th group, a end-user/developer hybrid that I'll
                            > call "analysts" for simplicity, that simply want to be able to use
                            > the data themselves to do their own queries instead of relying on
                            > someone creating a website or tool to provide data displays, but
                            are
                            > not necessarily interested in creating such tools or websites
                            > themselves.
                            >
                            > As I consider myself a member of these group #'s 3 & 4, I haven't
                            > been very excited about the move to numeric keys, because as others
                            > have pointed out it makes maintenance a little more difficult, and
                            > sometimes makes writing queries more cumbersome. It may also make
                            > it slightly more difficult to 'cross-link' to other data sets, such
                            > as Retrosheet for an example.
                            >
                            > However, I think the key thought is this posted comment:
                            >
                            > "I think that a normalized, numberic key database is probably best
                            > (and think that it would be even better with some foreign key
                            > constraints), because it's easiest to keep the database consistent
                            > in this format."
                            >
                            > I believe I understand enough about normalization to realize that
                            > the benefits outlined in the above comment are going to outweigh
                            the
                            > minor inconveniences of not having the more 'user-friendly' keys.
                            > There are ways around not having the "old" keys - for players, for
                            > exmample, it could be simply using "Lastname", "firstname", "DOB",
                            > fields whenever doing maintenance or creating new data.
                            >
                            > I WOULD strongly request that for some tables, such as Teams or
                            > Leagues, that we not "throw away" what are today keys, such
                            as "SLN"
                            > for St. Louis Cardinals, or "AA" for American Assocation, but
                            > instead change those fields from keys to "abbreviation" or
                            something
                            > similar, as I think in those cases it's worth keeping both the
                            > current key and adding the new numeric keys.
                            >
                            >
                            > To try to wrap up, what I see as the concerns about moving to
                            > numeric keys are the ability of End-Users, Data Contributors, and
                            > Analysts to do things such as:
                            >
                            >
                            > 1. Maintain the existing data, and easily incorporate changes,
                            > corrections, and new information.
                            >
                            > 2. Be able to manipulate and analyze the data with queries that are
                            > not too complex for non-developers to write.
                            >
                            > 3. Tie together other data sets such as Retrosheet, or Minor League
                            > data, or new research, with the BDB dataset, such as if we have
                            > Satchel Paige's Negro League stats, for example, to be able to
                            > combine them with his major league stats into one player "page".
                            >
                            >
                            > As I envision the new version, I believe we can move to numeric
                            keys
                            > AND still be able to address these concerns.
                            >
                            > THANKS,
                            > Kevin Johnson
                            >
                          • Keith Hemmelman
                            I don t disagree that the purpose of this group should be to the high end developer. I will say that the description of the group at the Yahoo page doesn t
                            Message 13 of 20 , Jan 26, 2006
                              I don't disagree that the purpose of this group should be to the high end developer.  I will say that the description of the group at the Yahoo page doesn't say that though when I signed up.  It says:
                              ===========
                              Description
                              -----Original Message-----
                              From: baseball-databank@yahoogroups.com [mailto:baseball-databank@yahoogroups.com]On Behalf Of Tangotiger
                              Sent: Tuesday, January 24, 2006 9:58 PM
                              To: baseball-databank@yahoogroups.com
                              Subject: Re: [baseball-databank] Proposal: Convert current db back to solely numeric keys and use this as the base db

                              At one point, perhaps still the current point, the
                              mission statement of this group was to produce a DB
                              for the DB developer, and not the end-user.  The DB
                              developers would then be charged to produce a
                              user-friendly database (like Lahman).

                              To that end, all design principles, data sources, and
                              requirements need to be sorted out, so that when BDB,
                              Release 1 is finally released, it will be a stable
                              design, one which can be expanded with ease, with
                              minimal changes.  Numeric keys, and other design
                              considerations would go into here.

                              And once that happens, the DB developers will take
                              this database will produce a truly remarkable database
                              in all forms (web, db, palm, whatever) for the
                              end-user, one where the end-user is not going to try
                              to figure out how to join tables, or worry about key
                              changes, or design changes.

                              These two steps have to go hand-in-hand, so that no
                              one will be left out in the lurches. (Seamless
                              transition.)

                              I also think that until we get something stable, the
                              end-user, right now, would simply prefer to get an
                              updated list of data, say batting2005.csv, that he can
                              simply import into his MS Access or MySQL database,
                              without needing to worry that "things have changed".
                              I think Lahman recently said that he prefers an
                              "overhaul" every 3 or 4 years, and I agree. 

                              Tom


                              __________________________________________________
                              Do You Yahoo!?
                              Tired of spam?  Yahoo! Mail has the best spam protection around
                              http://mail.yahoo.com
                            • Sean Forman
                              Tango, What do you see as step (b)? This is a very diverse wide ranging group that has never met in person and have never talked on the phone. I think
                              Message 14 of 20 , Jan 26, 2006
                                Tango,

                                What do you see as step (b)?

                                This is a very diverse wide ranging group that has never met in person and have never talked on the phone.  I think expecting a detailed spec is too much to ask.  I think that while we want to minimize redundant work or work that is later undone, there is no possible way to avoid it.  I think we should figure out what are the high impact moves we can make now and then go from there.  Incremental steps not large-scale design.

                                If we are going to eventually go to numeric keys then I think numeric keys are the first thing we should do to get this boat (a 15th century expolation) out of the harbor.

                                sean

                                On 1/25/06, Tangotiger <tangotiger@...> wrote:
                                The question of whether to go to numeric keys or not
                                is not really the issue.  This is going to happen
                                eventually.  It's not if, but when (1 year or 10
                                years).

                                The real question is how to get from "a" (current
                                database) to "z" (the "good" database).  With our
                                current setup (volunteer), we've got to get to step b,
                                to step c, etc. Numeric keys is one of the steps.
                                There's a whole set of steps to follow.  Those have
                                yet to be identified.

                                As KJOK was quick to remind us, there are actually
                                other groups involved, notably the data providers.  We
                                need to (eventually) integrate with (or absorb) them.

                                The other question being asked (or should be asked) is
                                when to rollout these releases.  In my opinion, until
                                you have requirements, analysis, and design, you can't
                                do an implementation.  This means, we should maintain
                                status quo (just provide incremental data updates to
                                the existing data structure).

                                Tom


                                __________________________________________________
                                Do You Yahoo!?
                                Tired of spam?  Yahoo! Mail has the best spam protection around
                                http://mail.yahoo.com
                              • An effort to accumulate and redistribute baseball data in a
                                convenient and easy to use form.
                                ===========
                                Regardless of this, I feel the design of the database needs to be such that it is tuned to those who do all the work so it is easier and more efficient for them.  That just makes the most sense.

                                Perhaps simply releasing the data corrections in a CSV file in the current format is a good solution for the immediate term and then a new format can be released when ready with the understanding that the new format will be used from this point forward since it is more efficient to work with.
                                 
                                I can work with either the current format or the numeric key format.  And if it came to the point that I couldn't work with any format, it's not the end of the world as like you said, there are others that do provide additional formats of this data.  And even if that didn't occur, I'll still survive.
                                 
                                Keith Hemmelman 
                                Mlb baseball Major league baseball Youth baseball
                                Major league baseball ticket Baseball equipment


                                YAHOO! GROUPS LINKS






                                --
                                Sincerely,        
                                Sean Forman

                                Baseball Stats!   http://www.Baseball-Reference.com/
                            • Charles Creasy
                              I m unsure on the method used for assigning the keys used, but I believe yahoo sports, espn, cnnsi, and a few other sites all have the same numeric key for
                              Message 15 of 20 , Jan 26, 2006
                                I'm unsure on the method used for assigning the keys used, but I believe
                                yahoo sports, espn, cnnsi, and a few other sites all have the same numeric
                                key for each player. If a change is made, assigning keys based on that
                                algorithm would make including data from those places that much easier.

                                -----Original Message-----
                                From: baseball-databank@yahoogroups.com
                                [mailto:baseball-databank@yahoogroups.com] On Behalf Of Sean Forman
                                Sent: Tuesday, January 24, 2006 7:55 PM
                                To: Baseball Databank; BDB-board@yahoogroups.com
                                Subject: [baseball-databank] Proposal: Convert current db back to solely
                                numeric keys and use this as the base db

                                I want to get the ball rolling here. I wanted to put this out for
                                public discussion to hash out issues and come to a consensus and then
                                put the board to use by voting to approve or disapprove this proposal.


                                Proposal:
                                First off, I apologize for the confusion that my premature release of
                                a numeric key db in August 2005 caused. I put aside some time to work
                                on the db and wanted to show people what I had and get some feedback.
                                I now realize that was not the most constructive thing to do and I
                                should have made clear that it was a branch of the current db.

                                That said, this proposal is that we should create version 2005.1 using
                                the current data (with recent non-controversial errata included) in
                                the form of a solely numeric key db along the lines of the August 2005
                                db. In fact, I'll propose that the August 2005 db is the framework we
                                use (incorporating helpful suggestions) and then move forward on the
                                addition and correction of data from there based on that dataset.

                                Since we've never done this before, I'm proposing discussion for the
                                period of one week and then the board will vote from Tues. Jan. 31 to
                                Feb. 3. With at least four yes votes and no no votes signalling the
                                proposal passes. I'd love for it to be unanimous. I'm happy to
                                discuss on or off-list if anyone is concerned about the proposed
                                process.

                                Board and non-board members are welcome to discuss this proposal.

                                Sean Forman

                                --
                                Sincerely,
                                Sean Forman
                              • Tangotiger
                                Step (b), or by March 1 , should be to implement the design I posted here, on or around Jan 31, 2003. That one includes the Parks Database from KJOK, and
                                Message 16 of 20 , Jan 26, 2006
                                  Step (b), or "by March 1", should be to implement the
                                  design I posted here, on or around Jan 31, 2003. That
                                  one includes the Parks Database from KJOK, and removes
                                  redundant fields. (I didn't download the current BDB,
                                  but I assume that KJOK's Parks DB is not part of it.)

                                  As Ruane and KJOK have each noted, there are good
                                  reasons to have a non-numeric key, and that is one for
                                  maintainability. As long as we don't have an
                                  application that controls how the data is input (which
                                  would allow us to go to all-numeric keys), starting
                                  with a numeric-only keys will introduce errors in the
                                  long-run.

                                  Step b, in essence, is to gather the data we have into
                                  a 3NF form.

                                  Tom


                                  --- Sean Forman <sean-forman@...>
                                  wrote:

                                  > Tango,
                                  >
                                  > What do you see as step (b)?
                                  >
                                  > This is a very diverse wide ranging group that has
                                  > never met in person and
                                  > have never talked on the phone. I think expecting a
                                  > detailed spec is too
                                  > much to ask. I think that while we want to minimize
                                  > redundant work or work
                                  > that is later undone, there is no possible way to
                                  > avoid it. I think we
                                  > should figure out what are the high impact moves we
                                  > can make now and then go
                                  > from there. Incremental steps not large-scale
                                  > design.
                                  >
                                  > If we are going to eventually go to numeric keys
                                  > then I think numeric keys
                                  > are the first thing we should do to get this boat (a
                                  > 15th century
                                  > expolation) out of the harbor.
                                  >
                                  > sean
                                  >
                                  > On 1/25/06, Tangotiger <tangotiger@...> wrote:
                                  > >
                                  > > The question of whether to go to numeric keys or
                                  > not
                                  > > is not really the issue. This is going to happen
                                  > > eventually. It's not if, but when (1 year or 10
                                  > > years).
                                  > >
                                  > > The real question is how to get from "a" (current
                                  > > database) to "z" (the "good" database). With our
                                  > > current setup (volunteer), we've got to get to
                                  > step b,
                                  > > to step c, etc. Numeric keys is one of the steps.
                                  > > There's a whole set of steps to follow. Those
                                  > have
                                  > > yet to be identified.
                                  > >
                                  > > As KJOK was quick to remind us, there are actually
                                  > > other groups involved, notably the data providers.
                                  > We
                                  > > need to (eventually) integrate with (or absorb)
                                  > them.
                                  > >
                                  > > The other question being asked (or should be
                                  > asked) is
                                  > > when to rollout these releases. In my opinion,
                                  > until
                                  > > you have requirements, analysis, and design, you
                                  > can't
                                  > > do an implementation. This means, we should
                                  > maintain
                                  > > status quo (just provide incremental data updates
                                  > to
                                  > > the existing data structure).
                                  > >
                                  > > Tom
                                  > >
                                  > >
                                  > > __________________________________________________
                                  > > Do You Yahoo!?
                                  > > Tired of spam? Yahoo! Mail has the best spam
                                  > protection around
                                  > > http://mail.yahoo.com
                                  > >
                                  > >
                                  > > http://www.baseball-databank.org/
                                  > >
                                  > >
                                  > >
                                  > > SPONSORED LINKS
                                  > > Mlb
                                  >
                                  baseball<http://groups.yahoo.com/gads?t=ms&k=Mlb+baseball&w1=Mlb+baseball&w2=Major+league+baseball&w3=Youth+baseball&w4=Major+league+baseball+ticket&w5=Baseball+equipment&c=5&s=123&.sig=-ekzHlwOuUrqCHgTQgUKzQ>
                                  > Major
                                  > > league
                                  >
                                  baseball<http://groups.yahoo.com/gads?t=ms&k=Major+league+baseball&w1=Mlb+baseball&w2=Major+league+baseball&w3=Youth+baseball&w4=Major+league+baseball+ticket&w5=Baseball+equipment&c=5&s=123&.sig=-guA5deGGP7NLhqU_9he8A>
                                  > Youth
                                  > >
                                  >
                                  baseball<http://groups.yahoo.com/gads?t=ms&k=Youth+baseball&w1=Mlb+baseball&w2=Major+league+baseball&w3=Youth+baseball&w4=Major+league+baseball+ticket&w5=Baseball+equipment&c=5&s=123&.sig=LWGtXJaS4E379R19WHmAvA>
                                  > Major
                                  > > league baseball
                                  >
                                  ticket<http://groups.yahoo.com/gads?t=ms&k=Major+league+baseball+ticket&w1=Mlb+baseball&w2=Major+league+baseball&w3=Youth+baseball&w4=Major+league+baseball+ticket&w5=Baseball+equipment&c=5&s=123&.sig=hQo1WO_9UmyH5yEO_MPToA>
                                  > Baseball
                                  > >
                                  >
                                  equipment<http://groups.yahoo.com/gads?t=ms&k=Baseball+equipment&w1=Mlb+baseball&w2=Major+league+baseball&w3=Youth+baseball&w4=Major+league+baseball+ticket&w5=Baseball+equipment&c=5&s=123&.sig=RK3mJf5O3_WdVclazdQ1pg>
                                  > > ------------------------------
                                  > > YAHOO! GROUPS LINKS
                                  > >
                                  > >
                                  > > - Visit your group
                                  >
                                  "baseball-databank<http://groups.yahoo.com/group/baseball-databank>"
                                  > > on the web.
                                  > >
                                  > > - To unsubscribe from this group, send an
                                  > email to:
                                  > >
                                  >
                                  baseball-databank-unsubscribe@yahoogroups.com<baseball-databank-unsubscribe@yahoogroups.com?subject=Unsubscribe>
                                  > >
                                  > > - Your use of Yahoo! Groups is subject to the
                                  > Yahoo! Terms of
                                  > > Service <http://docs.yahoo.com/info/terms/>.
                                  > >
                                  > >
                                  > > ------------------------------
                                  > >
                                  >
                                  >
                                  >
                                  > --
                                  > Sincerely,
                                  > Sean Forman
                                  >
                                  > Baseball Stats! http://www.Baseball-Reference.com/
                                  >


                                  __________________________________________________
                                  Do You Yahoo!?
                                  Tired of spam? Yahoo! Mail has the best spam protection around
                                  http://mail.yahoo.com
                                • Paul Wendt
                                  ... I don t disagree that the purpose of this group should be to the high end developer.
                                  Message 17 of 20 , Jan 27, 2006
                                    Keith Hemmelman <khemmelman@p...> wrote:
                                    >>
                                    I don't disagree that the purpose of this group should be to the high
                                    end developer.
                                    <<

                                    That isn't true and I missed it if anyone said it.

                                    > I will say that the description of the group at the Yahoo page
                                    > doesn't say that though when I signed up. It says:
                                    > ===========
                                    > Description
                                    > An effort to accumulate and redistribute baseball data in a
                                    > convenient and easy to use form.
                                    > ===========

                                    That's right. 'BDB-Design' and 'BDB-Board' are distinct egroups.
                                    'baseball-databank' needs to serve the accumulation and correction of
                                    data. We have a problem is this discussion overwhelms those functions
                                    and maybe a fatal problem if this discussion chases those activities away.

                                    Paul Wendt
                                  • Michael Westbay
                                    As Tangotiger stated, it s not if, but when. Many concerns were brought up regarding the move to numeric keys, but the most important one to me would be the
                                    Message 18 of 20 , Jan 27, 2006
                                      As Tangotiger stated, "it's not if, but when." Many concerns were
                                      brought up regarding the move to numeric keys, but the most important
                                      one to me would be the maintainability question.

                                      1. Do we have the administrative front end (web site) in place to
                                      maintain the database with all necessary views and/or cross references?
                                      (These would be for the current format, but their existence would mean
                                      easy porting to the numeric format.)

                                      2. Do we have scripts/applications in place to handle receiving data
                                      with alphanumeric Lahman IDs and insert/update them with the numeric
                                      Lahman IDs?

                                      3. If "no" to #2, are these planned?

                                      4. All players already have a numeric ID in the LahmanID field. I
                                      would assume that this will not change.

                                      There has been a lot of discussion about the difference between the
                                      administrative and distributed databases, but very little about how the
                                      administrative version will be handled. These questions are geared
                                      toward focusing on the maintainability of the administrative version,
                                      and whether or not *now* is the time to make the move.

                                      If too many of the above questions are answered "no," then the question
                                      becomes:

                                      5. Do we just want to dive in head first and hope the pool isn't too
                                      shallow?

                                      --
                                      Michael Westbay
                                      Writer/System Administrator
                                      http://JapaneseBaseball.com
                                      Public Key: http://www.japanesebaseball.com/keys/westbaystars.gpgkey
                                    • Sean Forman
                                      ... 1. Do we have the administrative front end (web site) in place to ... I have set up a postgresql database for the board members with usernames and
                                      Message 19 of 20 , Jan 29, 2006
                                        On 1/27/06, Michael Westbay <westbaystars@...> wrote:
                                        As Tangotiger stated, "it's not if, but when."  Many concerns were
                                        brought up regarding the move to numeric keys, but the most important
                                        one to me would be the maintainability question.

                                         

                                        1.  Do we have the administrative front end (web site) in place to
                                        maintain the database with all necessary views and/or cross references?
                                          (These would be for the current format, but their existence would mean
                                        easy porting to the numeric format.)


                                        I have set up a postgresql database for the board members with usernames and passwords, but few asked for their passwords and fewer gave me feedback as to whether it was working for them or maybe if we should try mysql or something else. I have made some headway incorporating triggers and believe we could use an audit table to track changes and/or I could set up the update log featues.
                                         

                                        2.  Do we have scripts/applications in place to handle receiving data
                                        with alphanumeric Lahman IDs and insert/update them with the numeric
                                        Lahman IDs?

                                        3.  If "no" to #2, are these planned


                                        Not to be glib, but that really won't take any time at all.  I already have something that converts names to alphanumeric lahman id's which is a harder problem to solve.  I could pretty easily do the same thing with alphanumeric to numeric keys.

                                        http://www.baseball-reference.com/friv/link_players.cgi
                                         

                                        4.  All players already have a numeric ID in the LahmanID field.  I
                                        would assume that this will not change



                                        Of course not.
                                         

                                        There has been a lot of discussion about the difference between the
                                        administrative and distributed databases, but very little about how the
                                        administrative version will be handled.  These questions are geared
                                        toward focusing on the maintainability of the administrative version,
                                        and whether or not *now* is the time to make the move.

                                        If too many of the above questions are answered "no," then the question
                                        becomes:


                                        My hope when proposing a board structure was to distribute some of the work.  I hoped that other people who seem to have good ideas would step up and do some work on the project and remove me as the project bottleneck.  The role of the board, as I see it, is to lead the project through doing some work.

                                         If you feel those things above need to be done, how do you propose that they get done?

                                        Sincerely,        
                                        Sean Forman

                                        Baseball Stats!   http://www.Baseball-Reference.com/
                                      • Sean Forman
                                        Charles, Thank you for the suggestion. Those keys are based on STATS incorporated s data feed. I would be uncomfortable using another companies id system for
                                        Message 20 of 20 , Jan 29, 2006
                                          Charles,

                                          Thank you for the suggestion. Those keys are based on STATS
                                          incorporated's data feed. I would be uncomfortable using another
                                          companies id system for our db. I've seen cases where they have
                                          changed over time.

                                          On 1/26/06, Charles Creasy <chazcreasy1@...> wrote:
                                          > I'm unsure on the method used for assigning the keys used, but I believe
                                          > yahoo sports, espn, cnnsi, and a few other sites all have the same numeric
                                          > key for each player. If a change is made, assigning keys based on that
                                          > algorithm would make including data from those places that much easier.
                                          >


                                          Sincerely,
                                          Sean Forman

                                          Baseball Stats! http://www.Baseball-Reference.com/
                                        • Your message has been successfully submitted and would be delivered to recipients shortly.