Loading ...
Sorry, an error occurred while loading the content.

Policy 002 - Scope, Data to include

Expand Messages
  • Sean Forman
    What data should be included? At what level (career, season, game, play-by-play, pitch-by-pitch, anything), from what leagues (majors, minors, college,
    Message 1 of 13 , Aug 25, 2003
    • 0 Attachment
      What data should be included?

      At what level (career, season, game, play-by-play, pitch-by-pitch,
      anything),

      from what leagues (majors,
      minors, college, foreign, international, etc.)?

      What is is the core dB (meaning what appears in the main BDB zip file)?
      -------------------------------------

      The glib answer to the above is all.

      I feel like we should focus first off on complete season data. This
      would jibe well with retrosheet's play-by-play data and would keep us
      from recreating the wheel.

      With regard to leagues, etc. I would uncategorically avoid the
      incorporation of little league, high school, and legion data. ;-)

      As for other levels, I think that we can include what becomes available
      through the work of the BDB members.

      Other issues I'm missing?


      --
      Sincerely,
      Sean Forman

      Baseball Stats! http://www.Baseball-Reference.com/
      Baseball Analysis! http://www.BaseballPrimer.com/
    • KJOK
      ... file)? ... This ... us ... available ... I think seasonal summarized player/team data should be the core product. Career total summarized tables can be
      Message 2 of 13 , Aug 25, 2003
      • 0 Attachment
        --- In baseball-databank@yahoogroups.com, Sean Forman <sean-
        forman@b...> wrote:
        > What data should be included?
        >
        > At what level (career, season, game, play-by-play, pitch-by-pitch,
        > anything),
        >
        > from what leagues (majors,
        > minors, college, foreign, international, etc.)?
        >
        > What is is the core dB (meaning what appears in the main BDB zip
        file)?
        > -------------------------------------
        >
        > The glib answer to the above is all.
        >
        > I feel like we should focus first off on complete season data.
        This
        > would jibe well with retrosheet's play-by-play data and would keep
        us
        > from recreating the wheel.
        >
        > With regard to leagues, etc. I would uncategorically avoid the
        > incorporation of little league, high school, and legion data. ;-)
        >
        > As for other levels, I think that we can include what becomes
        available
        > through the work of the BDB members.
        >
        > Other issues I'm missing?
        >
        >
        > --
        > Sincerely,
        > Sean Forman
        I think seasonal summarized player/team data should be the "core"
        product. Career total summarized tables can be a higher
        detail "module" of the database. Game data can be a lower
        detail "module" of the database. Anything lower than game data, such
        as play-by-play or pitch-by-pitch, is probably getting too low detail
        for the scope of this database, and should be left to retrosheet.

        Situational data, such as LH/RH, Home/Away, etc. SUMMARIZED by
        seasonal totals, should also be included, but perhaps as a separate
        module from the core db.

        LEAGUES - US Major Leagues, Major Japanese Leagues, Major Negro
        Leagues, US Minor Leagues and US Major College Leagues, in roughly
        that order, would seem to be the most logical compilation. Again,
        however, maybe Negro, Japanese, Minor leagues, etc. should be
        considered as separate modules from the core db.
      • Mike Crain
        ... I completely agree. ... Good Idea. ... I m planning Senior Lg data, so I think seperate is sort of good. However, the MASTER table would then become an
        Message 3 of 13 , Aug 25, 2003
        • 0 Attachment
          --- KJOK <kjokbaseball@...> wrote:
          > --- In baseball-databank@yahoogroups.com, Sean Forman <sean-
          > I think seasonal summarized player/team data should be the "core"
          > product. Career total summarized tables can be a higher
          > detail "module" of the database. Game data can be a lower
          > detail "module" of the database. Anything lower than game data, such
          > as play-by-play or pitch-by-pitch, is probably getting too low detail
          > for the scope of this database, and should be left to retrosheet.


          I completely agree.
          >
          > Situational data, such as LH/RH, Home/Away, etc. SUMMARIZED by
          > seasonal totals, should also be included, but perhaps as a separate
          > module from the core db.

          Good Idea.

          >
          > LEAGUES - US Major Leagues, Major Japanese Leagues, Major Negro
          > Leagues, US Minor Leagues and US Major College Leagues, in roughly
          > that order, would seem to be the most logical compilation. Again,
          > however, maybe Negro, Japanese, Minor leagues, etc. should be
          > considered as separate modules from the core db.

          I'm planning Senior Lg data, so I think seperate is sort of good.
          However, the MASTER table would then become an issue.What if I want Sr Lg,
          Negro League, but not Minors or Japan? Do I have a seperate MASTER? I
          think the master should have ID's for all the above(Major Japanese
          Leagues, Major Negro Leagues, US Minor Leagues and US College Leagues),
          but maybe those would be considered optional "plug-ins".

          >
          >
        • tmasc@yahoo.com
          ... The DB developer (the target audience) will download everything, and based on the level of data he wants to show (MLB only say), he will create a set of
          Message 4 of 13 , Aug 25, 2003
          • 0 Attachment
            --- Mike Crain <ucraimx@...> wrote:
            > but maybe those would be considered optional
            > "plug-ins".
            >

            The DB developer (the target audience) will download
            everything, and based on the level of data he wants to
            show (MLB only say), he will create a set of tables
            that is based only on that data.

            So, if Lean Sahman wants to advertise the complete MLB
            historical database, then he provides his customers
            with only that, and he would not only extract the
            subset of tables that he needs, but also the subset of
            records and subset of fields.

            Tom



            __________________________________
            Do you Yahoo!?
            Yahoo! SiteBuilder - Free, easy-to-use web site design software
            http://sitebuilder.yahoo.com
          • Michael Westbay
            ... I get the impression that the word module in the first half means table. Is that also what you propose for Japanese, Minor, Negro, etc. leagues? I
            Message 5 of 13 , Aug 25, 2003
            • 0 Attachment
              KJOK wrote:

              >I think seasonal summarized player/team data should be the "core"
              >product. Career total summarized tables can be a higher
              >detail "module" of the database. Game data can be a lower
              >detail "module" of the database. [...]
              >
              >[...]
              >
              >LEAGUES - US Major Leagues, Major Japanese Leagues, Major Negro
              >Leagues, US Minor Leagues and US Major College Leagues, in roughly
              >that order, would seem to be the most logical compilation. Again,
              >however, maybe Negro, Japanese, Minor leagues, etc. should be
              >considered as separate modules from the core db.
              >

              I get the impression that the word "module" in the first half means
              "table." Is that also what you propose for Japanese, Minor, Negro, etc.
              leagues? I envision all common data within the same tables,
              distributions having the option of extracting data based on league.

              One thing I would like to do is have a player's full professional
              history available on their player profile page on my site. If all the
              data is in a single table, it's much easier to write a query to go
              through it sorted by league, year, term than to try to merge queries
              from a number of different tables.

              As Crain-san pointed out, the Master table MUST have IDs from all of
              the different leagues - not having a master for each league. Having one
              ID for one person will make tracing a given player's career from college
              through the Minors, through Asia and South America, to the Majors and
              management much easier to track. (I hope this doesn't become known as
              Big Brother Database - BBDB.)

              Data that is not under our control (Retrosheet, MLB.com, etc.) would
              have their own IDs and need to be bridged. But the core product of the
              database will consist of data for players (and managers) of all the
              leagues having their data stored together in common tables - with keys
              available to easily extract certain information (such as a single
              league) for modular (separate tables) redistribution.

              --
              Michael Westbay
              Writer/System Administrator
              http://JapaneseBaseball.com
            • Sean Forman
              ... This is discussion is a bit off topic, but important. I don t think that having a single table with all the batting data from the minors, majors, Japan,
              Message 6 of 13 , Aug 25, 2003
              • 0 Attachment
                Michael Westbay wrote:
                > KJOK wrote:
                >
                > >I think seasonal summarized player/team data should be the "core"
                > >product. Career total summarized tables can be a higher
                > >detail "module" of the database. Game data can be a lower
                > >detail "module" of the database. [...]
                > >
                > >[...]
                > >
                > >LEAGUES - US Major Leagues, Major Japanese Leagues, Major Negro
                > >Leagues, US Minor Leagues and US Major College Leagues, in roughly
                > >that order, would seem to be the most logical compilation. Again,
                > >however, maybe Negro, Japanese, Minor leagues, etc. should be
                > >considered as separate modules from the core db.
                > >
                >
                > I get the impression that the word "module" in the first half means
                > "table." Is that also what you propose for Japanese, Minor, Negro, etc.
                > leagues? I envision all common data within the same tables,
                > distributions having the option of extracting data based on league.


                This is discussion is a bit off topic, but important.

                I don't think that having a single table with all the batting data from
                the minors, majors, Japan, negro leagues, colleges, postseason,
                preseason, etc. is doable. I think it would be much easier to have a
                separate table for each type of batting data and it would make it much
                easier on the maintainers as well.

                There would be one master, but it would have keys for each of the
                modules. To combine this into a single uber-batting table, would then
                require just a single INSERT..SELECT query for each module.



                > As Crain-san pointed out, the Master table MUST have IDs from all of
                > the different leagues - not having a master for each league. Having one
                > ID for one person will make tracing a given player's career from college
                > through the Minors, through Asia and South America, to the Majors and
                > management much easier to track. (I hope this doesn't become known as
                > Big Brother Database - BBDB.)


                The idea is that we would have

                MASTER, majorsID, managerID, japaneseID, etc. etc.

                As the data gets matched up we would combine rows within the MASTER table.


                > Data that is not under our control (Retrosheet, MLB.com, etc.) would
                > have their own IDs and need to be bridged. But the core product of the
                > database will consist of data for players (and managers) of all the
                > leagues having their data stored together in common tables - with keys
                > available to easily extract certain information (such as a single
                > league) for modular (separate tables) redistribution.

                If we have 10 different types of batting data, I'm not sure that it will
                really be under our control. I find the idea of having tables

                Batting
                BattingJapan
                BattingJapanPost
                BattingPost
                BattingCollege
                BattingCollegeWS
                ....
                BattingBarnstorming
                BattingNegroLg

                a much easier thing to incorporate.

                Sincerely,
                Sean Forman

                Baseball Stats! http://www.Baseball-Reference.com/
                Baseball Analysis! http://www.BaseballPrimer.com/
              • KJOK
                ... the core ... roughly ... Again, ... means ... Negro, etc. ... league. ... from ... a ... much ... then ... all of ... Having one ... college ... and ...
                Message 7 of 13 , Aug 25, 2003
                • 0 Attachment
                  --- In baseball-databank@yahoogroups.com, Sean Forman <sean-
                  forman@b...> wrote:
                  > Michael Westbay wrote:
                  > > KJOK wrote:
                  > >
                  > > >I think seasonal summarized player/team data should be
                  the "core"
                  > > >product. Career total summarized tables can be a higher
                  > > >detail "module" of the database. Game data can be a lower
                  > > >detail "module" of the database. [...]
                  > > >
                  > > >[...]
                  > > >
                  > > >LEAGUES - US Major Leagues, Major Japanese Leagues, Major Negro
                  > > >Leagues, US Minor Leagues and US Major College Leagues, in
                  roughly
                  > > >that order, would seem to be the most logical compilation.
                  Again,
                  > > >however, maybe Negro, Japanese, Minor leagues, etc. should be
                  > > >considered as separate modules from the core db.
                  > > >
                  > >
                  > > I get the impression that the word "module" in the first half
                  means
                  > > "table." Is that also what you propose for Japanese, Minor,
                  Negro, etc.
                  > > leagues? I envision all common data within the same tables,
                  > > distributions having the option of extracting data based on
                  league.
                  >
                  >
                  > This is discussion is a bit off topic, but important.
                  >
                  > I don't think that having a single table with all the batting data
                  from
                  > the minors, majors, Japan, negro leagues, colleges, postseason,
                  > preseason, etc. is doable. I think it would be much easier to have
                  a
                  > separate table for each type of batting data and it would make it
                  much
                  > easier on the maintainers as well.
                  >
                  > There would be one master, but it would have keys for each of the
                  > modules. To combine this into a single uber-batting table, would
                  then
                  > require just a single INSERT..SELECT query for each module.
                  >
                  >
                  >
                  > > As Crain-san pointed out, the Master table MUST have IDs from
                  all of
                  > > the different leagues - not having a master for each league.
                  Having one
                  > > ID for one person will make tracing a given player's career from
                  college
                  > > through the Minors, through Asia and South America, to the Majors
                  and
                  > > management much easier to track. (I hope this doesn't become
                  known as
                  > > Big Brother Database - BBDB.)
                  >
                  >
                  > The idea is that we would have
                  >
                  > MASTER, majorsID, managerID, japaneseID, etc. etc.
                  >
                  > As the data gets matched up we would combine rows within the MASTER
                  table.
                  >
                  >
                  > > Data that is not under our control (Retrosheet, MLB.com, etc.)
                  would
                  > > have their own IDs and need to be bridged. But the core product
                  of the
                  > > database will consist of data for players (and managers) of all
                  the
                  > > leagues having their data stored together in common tables - with
                  keys
                  > > available to easily extract certain information (such as a single
                  > > league) for modular (separate tables) redistribution.
                  >
                  > If we have 10 different types of batting data, I'm not sure that it
                  will
                  > really be under our control. I find the idea of having tables
                  >
                  > Batting
                  > BattingJapan
                  > BattingJapanPost
                  > BattingPost
                  > BattingCollege
                  > BattingCollegeWS
                  > ....
                  > BattingBarnstorming
                  > BattingNegroLg
                  >
                  > a much easier thing to incorporate.
                  >
                  > Sincerely,
                  > Sean Forman
                  Sean has captured the idea I was trying to convey. When I
                  said "module", I was thinking of a group of separate tables, that
                  could still be "linked" or xref'd back with the "main" data tables if
                  that's what someone needed.
                • tmasc@yahoo.com
                  ... Seems to me that what we need is a DB admin set of tables, that are listed as Sean is describing, since some of the data, but not all is under our control.
                  Message 8 of 13 , Aug 25, 2003
                  • 0 Attachment
                    --- Sean Forman <sean-forman@...>
                    wrote:
                    > If we have 10 different types of batting data, I'm
                    > not sure that it will
                    > really be under our control. I find the idea of
                    > having tables
                    >
                    > Batting
                    > BattingJapan
                    > BattingJapanPost
                    > BattingPost
                    > BattingCollege
                    > BattingCollegeWS
                    > ....
                    > BattingBarnstorming
                    > BattingNegroLg
                    >
                    > a much easier thing to incorporate.
                    >

                    Seems to me that what we need is a DB admin set of
                    tables, that are listed as Sean is describing, since
                    some of the data, but not all is under our control.

                    Once the DB admin is ready to have a "release" ready
                    for production, he runs his scripts to do his matching
                    etc, and he created a set of DB developer set of
                    tables, such that all these Batting tables are merged
                    into 1, along with everything else, so that no table
                    structure is repeated, and all data is normalized.

                    From that point the DB developers can take these
                    tables, and extract what they want from them, to
                    give/sell to their customers.

                    Tom

                    __________________________________
                    Do you Yahoo!?
                    Yahoo! SiteBuilder - Free, easy-to-use web site design software
                    http://sitebuilder.yahoo.com
                  • Michael Westbay
                    ... OK. You re proposing that the raw data for each league be held in a type of BDBMaint set of tables, as I suggested for maintaining the Retrosheet,
                    Message 9 of 13 , Aug 25, 2003
                    • 0 Attachment
                      Sean Forman wrote:

                      >If we have 10 different types of batting data, I'm not sure that it will
                      >really be under our control. I find the idea of having tables
                      >
                      >Batting
                      >BattingJapan
                      >BattingJapanPost
                      >BattingPost
                      >BattingCollege
                      >BattingCollegeWS
                      >....
                      >BattingBarnstorming
                      >BattingNegroLg
                      >
                      >a much easier thing to incorporate.
                      >

                      OK. You're proposing that the raw data for each league be held in a
                      type of BDBMaint set of tables, as I suggested for maintaining the
                      Retrosheet, MLB.com, etc. data. What I would like to see is the above
                      Batting table be a combination of all of the tables and placed into the
                      "core" system (with INSERT..SELECT as you suggest) and we add
                      BattingMLB, BattingAAA, etc. to be maintained like the others.

                      >The idea is that we would have
                      >
                      >MASTER, majorsID, managerID, japaneseID, etc. etc.
                      >

                      This idea worries me. Too many IDs. If possible, [...]. No. This
                      discussion isn't slated until later - mid-October, I think.

                      Back on topic, I think that the scope of data has pretty much been
                      agreed upon. Whether the data is stored in different tables or not is a
                      discussion for later, but I think that we all agree that these are the
                      data sets to include in the core database. Correct?

                      --
                      Michael Westbay
                      Writer/System Administrator
                      http://JapaneseBaseball.com
                    • Paul Wendt
                      ... There seems to be agreement to focus on season-level records. Which season-level records? Beside fielding, pitching, batting, and baserunning for playes
                      Message 10 of 13 , Aug 25, 2003
                      • 0 Attachment
                        On Tue, 26 Aug 2003, Michael Westbay wrote:

                        > Back on topic, I think that the scope of data has pretty much been
                        > agreed upon. Whether the data is stored in different tables or not is
                        > a discussion for later, but I think that we all agree that these are
                        > the data sets to include in the core database. Correct?

                        There seems to be agreement to focus on season-level records.

                        Which season-level records? Beside fielding, pitching, batting, and
                        baserunning for playes and W-L for field managers,
                        consider
                        league - scheduled number of games, opening date, closing date
                        team - earned runs allowed, attendance
                        umpire - number of games worked, postseason work
                        player - salary

                        That is a quick list of examples, not a complete list.

                        P/\/ \/\/t
                        Paul Wendt, Watertown MA, USA <pgw@...>
                      • tmasc@yahoo.com
                        I agree with Michael that we should only be discussing what we want (requirements phase), and leave the design out of this discussion. The design issue is
                        Message 11 of 13 , Aug 26, 2003
                        • 0 Attachment
                          I agree with Michael that we should only be discussing
                          what we want (requirements phase), and leave the
                          design out of this discussion.

                          The design issue is perfectly suited for the DB Design
                          committee, and the whole purpose of that is to handle
                          these kinds of issues without bogging down the whole
                          group and the momentum of other issues. The key is to
                          delegate responsibility after requirements have been
                          specified.

                          Tom

                          __________________________________
                          Do you Yahoo!?
                          Yahoo! SiteBuilder - Free, easy-to-use web site design software
                          http://sitebuilder.yahoo.com
                        • Michael Westbay
                          ... The first three seem to have a consensus. As for question four, how about: 1. create_xxxx.sql for major databases where xxxx is any of { mysql |
                          Message 12 of 13 , Aug 27, 2003
                          • 0 Attachment
                            Current questions on scope:

                            >What data should be included?
                            >
                            >At what level (career, season, game, play-by-play, pitch-by-pitch,
                            >anything),
                            >
                            >from what leagues (majors,
                            >minors, college, foreign, international, etc.)?
                            >
                            >What is is the core dB (meaning what appears in the main BDB zip file)?
                            >

                            The first three seem to have a consensus. As for question four, how about:

                            1. create_xxxx.sql for major databases where xxxx is any of { mysql |
                            postgresql | oracle | msaccess | mssql }. I can help with the
                            first two, will need volunteers for the others or any others
                            nominated.
                            2. CSV data file for each table (to be decided on in October) with a
                            common identifier for NULL.
                            3. sed scripts for converting the BDB CSV files to particular DB
                            import formats. (sed is available for Windows, but I don't know
                            how many people still use it there. I know I did back when I used
                            Win95, but I wasn't a typical Windows user. Would a Windows' DB
                            administrator have such a tool? Would another tool be better?)
                            4. Shell scripts (batch files) for creating, loading (importing from
                            common CSV format), and exporting (to common CSV format) for CVS
                            update/checkin.
                            5. README.txt file with explanation on how to create own copy of the
                            database on each supported DBMS, how to work with CVS for
                            concurrency, and contributors. A short tutorial on how to
                            contribute would also be nice, but perhaps in a different file.
                            6. LICENSE.txt file that states the redistribution license (decided
                            on in a different thread) and copyright information.
                            7. FAQ.txt file that answers a lot of common questions.


                            Then there is a question of scope that wasn't asked: What do we do
                            about requests?

                            For example, we've decided on covering data from professional and
                            college leagues, yearly and by stint for each player, getting as
                            detailed as Batter vs. R/L Pitcher. What do we do when some researcher
                            asks how Bonds has hit against the top 10 pitchers for each year in the
                            National League over the past 10 years? Do we send them to the
                            Retrosheet group - since that's where we would probably be getting that
                            information? Do we have an administrator make the query and include it
                            in the next CVS update and future releases?

                            If we had scripts or applications that could crank such information out
                            of the Retrosheet data, I think it'd be fine. But I believe that the
                            purpose of the BDB was to create a database for redistribution. Is it
                            BDB's job to fulfill such requests (such as for distribution X who would
                            like that much detail)? Or is it distribution X's job to generate such
                            data with the hope that they'll donate it back to BDB? Data requests
                            might make a good policy Q & A in the FAQ, even before it becomes
                            frequently asked.

                            I think that that about covers it for scope. Any comments? Additions?
                            Subtractions?

                            --
                            Michael Westbay
                            Writer/System Administrator
                            http://JapaneseBaseball.com
                          • tmasc@yahoo.com
                            ... I like the Retrosheet policy: we are librarians. Therefore, the BDB admin group will provide only the raw data for all tables. You d have some extra things
                            Message 13 of 13 , Aug 27, 2003
                            • 0 Attachment
                              --- Michael Westbay
                              <westbaystars@...> wrote:
                              > detailed as Batter vs. R/L Pitcher. What do we do
                              > when some researcher
                              > asks how Bonds has hit against the top 10 pitchers

                              I like the Retrosheet policy: we are librarians.

                              Therefore, the BDB admin group will provide only the
                              raw data for all tables.

                              You'd have some extra things like BDB admin scripts
                              where you might do some of the things you say, but it
                              would not be part of the core requirements.

                              And you'd have maybe volunteers offer BDB developer
                              scripts to extract data in certain formats, etc.

                              I think this makes the most sense in focusing the BDB
                              group to collect data, and the BDB admin group to
                              sort/merge it in a normalized form.

                              Tom

                              __________________________________
                              Do you Yahoo!?
                              Yahoo! SiteBuilder - Free, easy-to-use web site design software
                              http://sitebuilder.yahoo.com
                            Your message has been successfully submitted and would be delivered to recipients shortly.