Loading ...
Sorry, an error occurred while loading the content.
 

Re: [baseball-databank] A grabbag of thoughts : Player ID's

Expand Messages
  • Sean Forman
    ... Sorry, I haven t been responding to notes on the board. The messages exploded just as I was leaving for SABR. Here are my thoughts on the player ID s.
    Message 1 of 13 , Jul 2, 2002
      > Have we come to any sort of decision on the player ID's? The reason I ask
      > is because Craig Tomarkin and Jim Albright have posted a lot of Japanese
      > data to their website. It would be nice to link this in, but the players
      > need to be ID'ed. I'd be happy to approach them about the process (I
      > wouldn't know how to tell if a name was shared by two different players),
      > but we should have our ducks in a row first, I think. At the same time,
      > I'd rather not wait too long to approach them, since as their DB grows,
      > it'll be a more arduous process to merge.

      > Regards,
      > Derek Adair

      Sorry, I haven't been responding to notes on the board. The messages
      exploded just as I was leaving for SABR. Here are my thoughts on the
      player ID's. I'll add more for the team and game ID's later.

      ------------------------------------------------------------------------
      >> By TangoTiger
      > I propose that each category of "people" gets its own set
      > of unique ids, and we create x-ref tables for every known matches.
      > This might get out of hand as you'd have a majors/minors,
      > minors/umpires, majors/managers, etc, etc, etc combinations.

      > I think we should try to avoid key changes as much as possible. If I
      > worked for Howe Sports Data I would say do the singlekey thing. But
      > that's cause I'd get paid to spend alot of time to get the data
      > right. In our case, we'd make so many mistakes that we'd end up with
      > many key changes. And this is never a good idea.



      I agree with this. I think the Master table would then have something
      like columns. Each of the categories below would be 5-2-2 (or if
      working with existing sets whatever their set uses) with the
      first five or two alphabetic characters of their last or first name
      used. (meaning Paul O'Neill is oneilpa01 and C.C. Sabathia is
      sabatcc01). Each category could also end in a suffix to differentiate
      each category's ID type.

      lahman_ID - which might be an auto-number column used to provide a
      unique key for all entries, or could be the current ID's
      used. We might want to make this a shorter or longer key
      in order to differentiate it from the rest. 3-1-4 or 7-3-2
      player_ID
      manager_ID
      coach_ID
      minors_ID
      HOF_ID
      ump_ID
      japan_ID
      executive_ID
      broadcast_ID
      negrolg_ID
      college_ID

      <Other systems>
      retrosheet_ID
      TB_ID
      lahman_4.0_ID
      lahman_4.5_ID

      followed by their biographical data: name, birth,death,height, weight,
      bats, throws

      This method would have a couple of nice benefits.

      The project could branch more easily. Let's say someone starts doing
      minor league data for the last twenty years and another person does
      executive data for the last twenty years and there are ten John Smiths
      in the minors and five John Smiths as executives. Each person could
      number them as they see fit and then the merge would involve matching
      up their tables. Then if we find out that the fifth minor league John
      Smith and the second exec John Smith are the same guy, fixing that is
      a breeze. Literally, just two sql commands.

      Adding a new dataset (like negro league data) would be much, much
      easier.

      Some queries would be very easy. Players who were also managers, e.g.

      Disadvanatages would include table size, lack of a consistent ID
      across tables for people like Clark Griffiths, John Ward, etc.

      Queries would become more complicated in some ways.

      Much more complicated.


      ------------------------
      USING Year of Birth in person ID's
      ------------------------

      I don't like this idea very much. I know that the odds are unlikely
      of a repeated key, but it definitely is possible. We came close with
      two Jeff D'Amico's or two Freddy Garica's. While sequential ordering
      is a pain, I think it is the worst solution, except for all the
      others.

      ------------------------------------
      Adding in Japanese data
      ------------------------------------
      >> Sean Lahman wrote
      > As to the broader issue, I'm not sure that it makes sense to
      incorporate the
      > Japanese data. Compatibility would be great, but it would represent a
      huge
      > increase in size. I'm not sure that there is enough interest to
      warrant this
      > move. It makes more sense for this to be an optional add-on. The same
      holds
      > for the minor league data, negro leagues, women, etc.


      Agreed. I envision a slim, basic database with a lot of tables that can
      be downloaded and added in by hand a la carte.


      Sincerely,
      Sean Forman

      Baseball Stats! http://www.Baseball-Reference.com/
      Baseball Analysis! http://www.BaseballPrimer.com/
    • Derek Adair
      ... Perhaps these should be different tables - an IDReference table and a bio data table. Birth and death are each six fields, and there s a lot of name data
      Message 2 of 13 , Jul 2, 2002
        On Tue, 2 Jul 2002, Sean Forman wrote:

        > lahman_ID - which might be an auto-number column used to provide a
        > unique key for all entries, or could be the current ID's
        > used. We might want to make this a shorter or longer key
        > in order to differentiate it from the rest. 3-1-4 or 7-3-2
        > player_ID
        > manager_ID
        > coach_ID
        > minors_ID
        > HOF_ID
        > ump_ID
        > japan_ID
        > executive_ID
        > broadcast_ID
        > negrolg_ID
        > college_ID
        >
        > <Other systems>
        > retrosheet_ID
        > TB_ID
        > lahman_4.0_ID
        > lahman_4.5_ID
        >
        > followed by their biographical data: name, birth,death,height, weight,
        > bats, throws

        Perhaps these should be different tables - an IDReference table and a bio
        data table. Birth and death are each six fields, and there's a lot of
        "name data" to be had as well (birth names, nicknames, etc.)

        I think the auto-increment is workable as long as its scope is limited.
        The above design seems to meet that need. (Except of course when I start
        talking about separate bio/ID tables).

        Regards,
        Derek Adair
        dadair@...
      • Sean Forman
        ... I don t think we want to split those. The only one we might want to split are bats throws, height weight since those are essentially player data values. I
        Message 3 of 13 , Jul 2, 2002
          Derek Adair wrote:
          > On Tue, 2 Jul 2002, Sean Forman wrote:
          >
          >
          >>lahman_ID - which might be an auto-number column used to provide a
          >> unique key for all entries, or could be the current ID's
          >> used. We might want to make this a shorter or longer key
          >> in order to differentiate it from the rest. 3-1-4 or 7-3-2
          >>player_ID
          >>manager_ID
          >>coach_ID
          >>minors_ID
          >>HOF_ID
          >>ump_ID
          >>japan_ID
          >>executive_ID
          >>broadcast_ID
          >>negrolg_ID
          >>college_ID
          >>
          >> <Other systems>
          >>retrosheet_ID
          >>TB_ID
          >>lahman_4.0_ID
          >>lahman_4.5_ID
          >>
          >>followed by their biographical data: name, birth,death,height, weight,
          >>bats, throws
          >
          >
          > Perhaps these should be different tables - an IDReference table and a bio
          > data table. Birth and death are each six fields, and there's a lot of
          > "name data" to be had as well (birth names, nicknames, etc.)
          >
          > I think the auto-increment is workable as long as its scope is limited.
          > The above design seems to meet that need. (Except of course when I start
          > talking about separate bio/ID tables).
          >
          > Regards,
          > Derek Adair
          > dadair@...


          I don't think we want to split those. The only one we might want to
          split are bats throws, height weight since those are essentially player
          data values.

          I think we want the name, birth, death in the main master table because
          every person in that table has that data even if we don't know what it
          is. Even though they don't have bats/throws, I think that should go in
          there just for ease of use.

          (also not only players will have bats/throws, but also college, minors,
          japanese, etc. will too).

          Sincerely,
          Sean Forman

          Baseball Stats! http://www.Baseball-Reference.com/
          Baseball Analysis! http://www.BaseballPrimer.com/
        • elbowruth
          ... Before I get much farther into a project, I d like to know what is planned for umpires. Do any of you already have a database on them? I have a rudimentary
          Message 4 of 13 , Jul 4, 2002
            --- In baseball-databank@y..., Sean Forman <sean-forman@b...> wrote:

            > I agree with this. I think the Master table would then have something
            > like columns.[...]
            > lahman_ID - which might be an auto-number column used to provide a
            > unique key for all entries, or could be the current ID's
            > used. We might want to make this a shorter or longer key
            > in order to differentiate it from the rest. 3-1-4 or 7-3-2
            > player_ID
            > manager_ID
            > coach_ID
            > minors_ID
            > HOF_ID
            > ump_ID
            > japan_ID
            > executive_ID
            > broadcast_ID
            > negrolg_ID
            > college_ID
            >
            > <Other systems>
            > retrosheet_ID
            > TB_ID
            > lahman_4.0_ID
            > lahman_4.5_ID
            >
            > followed by their biographical data: name, birth,death,height, weight,
            > bats, throws

            Before I get much farther into a project, I'd like to know what is
            planned for umpires. Do any of you already have a database on them?

            I have a rudimentary database of umpires taken from TBB that I started
            when Phyllis was chairing the Umpire and Rules Committee. Work on it
            has been sporadic, but after talking to John Schwartz in Boston, who
            is keeping his research on lined paper, I am convinced that the data
            must be put into electronic form. Since I have started it, I may as
            well continue. My goal is to add game assignments for every umpire
            using Retrosheet data as well as Schwartz's.

            I do have one comment and one question about ID's:

            1. When doing a database for which there is on-going research and the
            potential of discovering new umpires/players, using a sequential
            numbering system for duplicates is not optimal because assignments
            might get out of chronological order (aesthetically inferior, if
            nothing else). This is an argument in favor of a system I have seen
            discussed of incorporating the birthdate into the ID.

            2. On the other hand, if you don't have birthdate information
            available, it is difficult to incorporate it (and I don't for
            umpires). What are the rules for the ID format that I should use?
            First five letters of last name; first two of first name, two-digit
            sequential number? Remove apostrophes? Pad short last names?

            Clifford Otto
          • tom tom
            ... When designing a database, the optimal design is to create a key that is numeric and means absolutely nothing to anyone other than other tables in the
            Message 5 of 13 , Jul 5, 2002
              --- elbowruth <baseball318@...> wrote:
              ... using
              > a sequential
              > numbering system for duplicates is not optimal
              > because assignments
              > might get out of chronological order (aesthetically
              > inferior, if
              > nothing else). This is an argument in favor of a
              > system I have seen
              > discussed of incorporating the birthdate into the
              > ID.

              When designing a database, the optimal design is to
              create a key that is numeric and means absolutely
              nothing to anyone other than other tables in the
              database. Whether you choose to have sequentially
              assigned numbers, or random numbers or whatever is not
              important. There is no need whatsovever, from a
              design viewpoint, to have things in "chronological
              order" (whether that means by date of birth, or debut
              date or whatever).

              So, why do we sometimes create keys that actually mean
              something, like having partial names or birthdate or
              other identifiers? For the most part, it's for
              convenience. Databases aren't meant to be accessed by
              people at the table level. They are meant to be
              accessed by applications that follow the rules of the
              database.

              The Lahman database is a user-friendly database
              meaning that certain rules have to be broken in order
              for clarity to reign. Allowing the 5-2-2 scheme for a
              player id is simply far more convenient because there
              is no one authority that has access to everything and
              is willing to spend the time on it.

              We should not try to design a solution for keys other
              than for clarity. The optimal design would have keys
              that are meaningless to the user, but mean everything
              to the database.

              Thanks, Tom

              __________________________________________________
              Do You Yahoo!?
              Sign up for SBC Yahoo! Dial - First Month Free
              http://sbc.yahoo.com
            • elbowruth
              ... [...] When it comes to databases, I muddle along, but I do know that if I want the career stats of Marty Barrett of recent vintage, it is easier to deal
              Message 6 of 13 , Jul 5, 2002
                --- In baseball-databank@y..., tom tom <tmasc@y...> wrote:
                >
                > --- elbowruth <baseball318@h...> wrote:
                > ... using
                > > a sequential
                > > numbering system for duplicates is not optimal
                > > because assignments
                > > might get out of chronological order (aesthetically
                > > inferior, if
                > > nothing else). This is an argument in favor of a
                > > system I have seen
                > > discussed of incorporating the birthdate into the
                > > ID.
                >
                > When designing a database, the optimal design is to
                > create a key that is numeric and means absolutely
                > nothing to anyone other than other tables in the
                > database. Whether you choose to have sequentially
                > assigned numbers, or random numbers or whatever is not
                > important. There is no need whatsovever, from a
                > design viewpoint, to have things in "chronological
                > order" (whether that means by date of birth, or debut
                > date or whatever).
                >
                > So, why do we sometimes create keys that actually mean
                > something, like having partial names or birthdate or
                > other identifiers? For the most part, it's for
                > convenience. Databases aren't meant to be accessed by
                > people at the table level. They are meant to be
                > accessed by applications that follow the rules of the
                > database.
                [...]

                When it comes to databases, I muddle along, but I do know that if I
                want the career stats of Marty Barrett of recent vintage, it is easier
                to deal with barrema02 than with 12179 if I don't want to bother
                writing a query. This is just a personal preference. All I want to do
                is use the same format others are using. For example, is Bill O'Dell
                properly odellbi01 or o'delbi01 in the 5-2-2 system?

                Thanks,
                Clifford Otto
              • tom tom
                ... I was responding to your chronological thought on umpire IDs, and the reason that we want to use numeric keys. I reiterated that for user-friendly
                Message 7 of 13 , Jul 5, 2002
                  --- elbowruth <baseball318@...> wrote:
                  > When it comes to databases, I muddle along, but I do
                  > know that if I
                  > want the career stats of Marty Barrett of recent
                  > vintage, it is easier
                  > to deal with barrema02 than with 12179 if I don't
                  > want to bother
                  > writing a query. This is just a personal preference.
                  > All I want to do
                  > is use the same format others are using. For
                  > example, is Bill O'Dell
                  > properly odellbi01 or o'delbi01 in the 5-2-2 system?
                  >
                  >
                  I was responding to your chronological thought on
                  umpire IDs, and the reason that we want to use numeric
                  keys. I reiterated that for user-friendly databases,
                  clarity is more important, and so, we don't want a
                  numeric key. So, I agree.

                  As for the 5-2-2 system, I don't think we've come to a
                  conclusion. My preference is that we collapse all the
                  apostophes and various characters away first, and then
                  select the 5-2-2 from the remaining letters. So, Paul
                  O'Neill is oneilpa01. I don't want to sit there and
                  figure our if D'Angelo is D'Angelo or Dangelo.

                  Thanks, Tom

                  __________________________________________________
                  Do You Yahoo!?
                  Sign up for SBC Yahoo! Dial - First Month Free
                  http://sbc.yahoo.com
                • Derek Adair
                  ... I think this is what has been tentatively decided on. There s another issue here, though - since Clifford is working on umpiring data, it s necessary to
                  Message 8 of 13 , Jul 5, 2002
                    At 08:59 AM 7/5/02, you wrote:
                    >> All I want to do
                    >> is use the same format others are using. For
                    >> example, is Bill O'Dell
                    >> properly odellbi01 or o'delbi01 in the 5-2-2 system?

                    >As for the 5-2-2 system, I don't think we've come to a
                    >conclusion. My preference is that we collapse all the
                    >apostophes and various characters away first, and then
                    >select the 5-2-2 from the remaining letters. So, Paul
                    >O'Neill is oneilpa01. I don't want to sit there and
                    >figure our if D'Angelo is D'Angelo or Dangelo.

                    I think this is what has been tentatively decided on. There's
                    another issue here, though - since Clifford is working on
                    umpiring data, it's necessary to come to a decision on the
                    suffix issue, and if ID per "involvement category" is the
                    scheme that works best. I happen to think so, and think "u"
                    as a suffix is the obvious choice for the umpiring data.
                    So Charlie Reliford would be relifch01u.

                    I was planning on starting a set of umpire tables; I'm glad
                    someone else with more experience and more data is taking
                    up the charge. This project gets more exciting by the day.

                    Regards,
                    Derek Adair
                    dadair@...
                  • tom tom
                    ... I just want to reiterate my point that we should have aloufe01 ...MLB player aloufe01m ...minor league player aloufe01g ...manager and that we should not
                    Message 9 of 13 , Jul 5, 2002
                      --- Derek Adair <dadair@...> wrote:
                      > I think this is what has been tentatively decided
                      > on. There's
                      > another issue here, though - since Clifford is
                      > working on
                      > umpiring data, it's necessary to come to a decision
                      > on the
                      > suffix issue, and if ID per "involvement category"
                      > is the
                      > scheme that works best. I happen to think so, and
                      > think "u"
                      > as a suffix is the obvious choice for the umpiring
                      > data.
                      > So Charlie Reliford would be relifch01u.
                      >
                      I just want to reiterate my point that we should have
                      aloufe01 ...MLB player
                      aloufe01m ...minor league player
                      aloufe01g ...manager

                      and that we should not have aloufe01 for all three.
                      We don't want to start doing a whole bunch of key
                      changes every time we discover that aloufe01g is
                      actually aloufe01.

                      Instead, we have xref tables like
                      People_Table
                      MLB,Minor,Manager,Umpire,Japanese
                      aloufe01,aloufe01m,aloufe01g,<null>

                      It may take time to discover that all these alous is
                      the same guy, and we may have:
                      People_Table
                      MLB,Minor,Manager,Umpire,Japanese
                      aloufe01,aloufe01m,<null>,<null>
                      <null>,<null>,aloufe01g,<null>

                      Once we discover that it's the same alou, we collapse
                      his two records into one, AND WE DON'T TOUCH ANY OF
                      HIS OTHER TABLES, be it hitting, fielding, etc, etc,
                      etc.

                      This will allow everyone to work on their own
                      "leagues" (minors, jap, umpires, etc), without
                      affecting those people who don't care about them.

                      Thanks, Tom


                      __________________________________________________
                      Do You Yahoo!?
                      Sign up for SBC Yahoo! Dial - First Month Free
                      http://sbc.yahoo.com
                    • Derek Adair
                      ... I think this is what was planned, with the exception of the ID s all being in the Master table. Regards, Derek Adair dadair@iglou.com
                      Message 10 of 13 , Jul 5, 2002
                        On Fri, 5 Jul 2002, tom tom wrote:

                        > Instead, we have xref tables like
                        > People_Table
                        > MLB,Minor,Manager,Umpire,Japanese
                        > aloufe01,aloufe01m,aloufe01g,<null>
                        >
                        > It may take time to discover that all these alous is
                        > the same guy, and we may have:
                        > People_Table
                        > MLB,Minor,Manager,Umpire,Japanese
                        > aloufe01,aloufe01m,<null>,<null>
                        > <null>,<null>,aloufe01g,<null>
                        >
                        > Once we discover that it's the same alou, we collapse
                        > his two records into one, AND WE DON'T TOUCH ANY OF
                        > HIS OTHER TABLES, be it hitting, fielding, etc, etc,
                        > etc.
                        >
                        > This will allow everyone to work on their own
                        > "leagues" (minors, jap, umpires, etc), without
                        > affecting those people who don't care about them.

                        I think this is what was planned, with the exception of the ID's all being
                        in the Master table.

                        Regards,
                        Derek Adair
                        dadair@...
                      • tom tom
                        The IDs can t be in the master table, because you will have null fields in the key fields. This people table will have to have a sequential numeric key,
                        Message 11 of 13 , Jul 5, 2002
                          The IDs can't be in the master table, because you will
                          have null fields in the key fields.

                          This "people table" will have to have a "sequential"
                          numeric key, assigned internally. Or you have this
                          table with no key whatsoever.

                          Tom


                          --- Derek Adair <dadair@...> wrote:
                          > On Fri, 5 Jul 2002, tom tom wrote:
                          >
                          > > Instead, we have xref tables like
                          > > People_Table
                          > > MLB,Minor,Manager,Umpire,Japanese
                          > > aloufe01,aloufe01m,aloufe01g,<null>
                          > >
                          > > It may take time to discover that all these alous
                          > is
                          > > the same guy, and we may have:
                          > > People_Table
                          > > MLB,Minor,Manager,Umpire,Japanese
                          > > aloufe01,aloufe01m,<null>,<null>
                          > > <null>,<null>,aloufe01g,<null>
                          > >
                          > > Once we discover that it's the same alou, we
                          > collapse
                          > > his two records into one, AND WE DON'T TOUCH ANY
                          > OF
                          > > HIS OTHER TABLES, be it hitting, fielding, etc,
                          > etc,
                          > > etc.
                          > >
                          > > This will allow everyone to work on their own
                          > > "leagues" (minors, jap, umpires, etc), without
                          > > affecting those people who don't care about them.
                          >
                          > I think this is what was planned, with the exception
                          > of the ID's all being
                          > in the Master table.
                          >
                          > Regards,
                          > Derek Adair
                          > dadair@...
                          >
                          >


                          __________________________________________________
                          Do You Yahoo!?
                          Sign up for SBC Yahoo! Dial - First Month Free
                          http://sbc.yahoo.com
                        • Derek Adair
                          ... The master table will have the sequential ID. Message #560 in the archives details this proposal (and my suggestion which was pretty much what you re
                          Message 12 of 13 , Jul 5, 2002
                            At 05:09 PM 7/5/02, you wrote:
                            >The IDs can't be in the master table, because you will
                            >have null fields in the key fields.
                            >
                            >This "people table" will have to have a "sequential"
                            >numeric key, assigned internally. Or you have this
                            >table with no key whatsoever.

                            The master table will have the sequential ID. Message #560
                            in the archives details this proposal (and my suggestion
                            which was pretty much what you're saying - I agree with
                            the points Sean F. makes in that thread as to keeping that
                            data together).

                            Regards,
                            Derek
                          • Sean Forman
                            ... The NULL field issue is going to be a tough one because we will have a lot of NULL fields in the Master table once we start to add things like
                            Message 13 of 13 , Jul 7, 2002
                              Derek Adair wrote:
                              > At 05:09 PM 7/5/02, you wrote:
                              >
                              >>The IDs can't be in the master table, because you will
                              >>have null fields in the key fields.
                              >>
                              >>This "people table" will have to have a "sequential"
                              >>numeric key, assigned internally. Or you have this
                              >>table with no key whatsoever.
                              >
                              >
                              > The master table will have the sequential ID. Message #560
                              > in the archives details this proposal (and my suggestion
                              > which was pretty much what you're saying - I agree with
                              > the points Sean F. makes in that thread as to keeping that
                              > data together).
                              >
                              > Regards,
                              > Derek


                              The NULL field issue is going to be a tough one because we will have a
                              lot of NULL fields in the Master table once we start to add things like
                              broadcaster_ID, japan_ID, etc. And NULL fields make lookups tough at
                              least in MySQL since you can't create indices on fields with NULL
                              entries. In practice changing these to empty entries and creating
                              indices on these fields will be the best choices for the users, but I
                              think for the released database we should use NULL instead of other stuff.

                              Just to sum things up to this point.

                              Unless otherwise stated. Everything is 5-2-2 with non-alphanumeric
                              characters removed (and a suffix added for many categories of players).

                              Jin Ho Cho choji01
                              Paul O'Neill oneilpa01
                              D'Angelo Jimenez jimenda01
                              J.T. Snow snowjt01
                              Ty Van Burkelo vanbuty01

                              and so on

                              The Master table will have

                              lahmanID an auto-numbered field that will serve as the primary key
                              (other choice is something like 3-1-4)

                              playerID - no suffix
                              minors - 'm' suffix
                              managerID - 'g'
                              HOFID - 'h'
                              negrolgID - 'n'

                              and so on.


                              Question?

                              Should the Awards table be split into Awards_players, Awards_managers,
                              Awards_execs, or should we add an awardsID? I would do the first,
                              personally.

                              later,
                              sean

                              Baseball Stats! http://www.Baseball-Reference.com/
                              Baseball Analysis! http://www.BaseballPrimer.com/
                            Your message has been successfully submitted and would be delivered to recipients shortly.