Loading ...
Sorry, an error occurred while loading the content.

Data Additions

Expand Messages
  • truane@vnet.ibm.com
    Thanks to all who have responded to my notes on what might be useful additions to the baseball-databank effort. This note is an attempt to summarize what is
    Message 1 of 7 , Apr 27, 2004
    • 0 Attachment
      Thanks to all who have responded to my notes on what might be useful
      additions to the baseball-databank effort.

      This note is an attempt to summarize what is needed.

      For the season's we have event file coverage, I think we'd like
      the following:

      batting:

      id,year,team,roe

      pitching:

      id,year,team,2b,3b,roe

      Note: I'm not sure what to do about the information for players
      appearing in games for which are missing play-by-play files. These
      include a few games in the 1963 AL as well as several in the NL
      from 1971 to 1973. Do you want to skip this information for those
      league-seasons? Do you want to skip this information for only
      those player participating in those missing games, or do want to
      include all players with an indication of the players for whom
      this data is incomplete?

      I also could not provide this information from 1993 to the present.

      for pitchers who both started and relieved in the same season,
      a complete line of relief stats.

      fielding:

      id,year,team,pos,starts,innings,sb,cs,pb,pickoffs,xi

      sb,cs,pb,pickoffs,xi would only be included for C and P

      Again, I have a similar question about the 1963 AL and 1971-73 NL.
      And, apart from starts, I don't think I can provide this data from
      1993 to the present.

      A few other notes:

      Sean has written:

      > We could also use separate LF/CF/RF data for any years you can
      > provide it.

      And later wrote:

      > I'm referring to the errors, PO, A, Innings data for pre-2000 years. I
      > think you did get the games played for pre-retro years from me.

      One problem with providing this (and it's been a small source of
      confusion for some visitors to Retrosheet's site) is that this data is
      decidedly unofficial and, if provided, will not sum to the official
      OF data. So if you take the RF,CF,LF errors, PO and A data and add
      it together, it will bear little relation to the official error, PO
      and A data for OF.

      He also wrote:

      > A listing of all triple plays you have

      And later wrote:

      > Well, if you could just provide a list of year, team, playerID, TP, then
      > we ought to be able to fold that into the fielding table. I also think
      > a listing of the triple plays in some way would be fun to have in the
      > DB. Date, teams, situation, scoring, player's involved.

      I think you ought to contact Steve Boren and see if he would donate
      his list. While I could send you a list of the ones we have from
      the 1960s to 1992, he has a DB containing details on every TP ever
      made.

      Kevin Johnson wrote:

      > Also, batting and pitching data by BALLPARK by year, broken down by
      > home team and away team, and by LH and RH hitters, would be awsome.
      > I already have this data for most "old" retrosheet years, but with
      > the demise of A.S.S. I don't have it for 2003 or for the recently
      > released Retrosheet seasons.

      I could not provide LH/RH data for 2003, but could provide the rest
      of the data for 2003 as well as for the other missing years. What
      league/seasons are you missing?

      Sorry for the delay in getting back to the list about this and, as
      always, thanks for your patience.
      Tom Ruane
    • tangotiger
      ... Hmmm... this would be a pickle. This would mean for a given season: a) some players will have complete records for all fields b) some players will have
      Message 2 of 7 , Apr 27, 2004
      • 0 Attachment
        --- In baseball-databank@yahoogroups.com, truane@v... wrote:
        > Note: I'm not sure what to do about the information for players
        > appearing in games for which are missing play-by-play files. These

        Hmmm... this would be a pickle. This would mean for a given season:
        a) some players will have complete records for all fields
        b) some players will have complete records for some fields, and
        incomplete for others

        If you have a flag for each player, the user then has to know (or be
        made aware by some readme file), the fields in question that are
        incomplete.

        I think this data could be supplied as some "add on" csv file, but
        not incorporated into the core files. (This would only affect those
        years that Tom mentioned.)

        > So if you take the RF,CF,LF errors, PO and A data and add
        > it together, it will bear little relation to the official error, PO
        > and A data for OF.

        I can see this is true for "G", but not the case for GS, Inn, PO, A,
        E. Those fields would be summable from RF/CF/LF into OF. Am I
        misreading you here?


        > Sorry for the delay in getting back to the list about this and, as
        > always, thanks for your patience.

        Anytime someone connected from Retrosheet makes an offer, it's like
        Santa has come to town. I'm looking forward to these additions into
        the BDB.

        Tom
      • KJOK
        ... by ... awsome. ... The years we currently have LH/RH, Home/Away batting stats are both leagues 1969, 1972-1992, 1999-2002, and AL only 1963, 1967-1968.
        Message 3 of 7 , Apr 28, 2004
        • 0 Attachment
          --- In baseball-databank@yahoogroups.com, truane@v... wrote:
          > Kevin Johnson wrote:
          >
          > > Also, batting and pitching data by BALLPARK by year, broken down
          by
          > > home team and away team, and by LH and RH hitters, would be
          awsome.
          > > I already have this data for most "old" retrosheet years, but with
          > > the demise of A.S.S. I don't have it for 2003 or for the recently
          > > released Retrosheet seasons.
          >
          > I could not provide LH/RH data for 2003, but could provide the rest
          > of the data for 2003 as well as for the other missing years. What
          > league/seasons are you missing?
          >
          > Sorry for the delay in getting back to the list about this and, as
          > always, thanks for your patience.
          > Tom Ruane

          The years we currently have LH/RH, Home/Away batting stats are both
          leagues 1969, 1972-1992, 1999-2002, and AL only 1963, 1967-1968.

          THANKS,
          Kevin
        • tjruane
          ... I think so. I am assuming you are going to continue using the official fielding data for outfielders. I would certainly recommend that since in many
          Message 4 of 7 , May 1, 2004
          • 0 Attachment
            A few days ago, I wrote:

            > So if you take the RF,CF,LF errors, PO and A data and add
            > it together, it will bear little relation to the official error, PO
            > and A data for OF.

            And Tom replied:

            > I can see this is true for "G", but not the case for GS, Inn,
            > PO, A, E. Those fields would be summable from RF/CF/LF into OF.
            > Am I misreading you here?

            I think so. I am assuming you are going to continue using the
            official fielding data for outfielders. I would certainly
            recommend that since in many instances there is no way of telling
            which account of a play is correct (and the benefit of the doubt
            has to go to the official account in these circumstances). As
            a result, Retrosheet's RF/CF/LF data will certainly sum to what
            we would present as OF statistics, but this sum will bear little
            relationship to the official OF data.

            Tom Ruane
          • tmasc@yahoo.com
            ... Hmmm... this would imply then that any nonofficial breakdown that we present, say splitting data between starts and reliefs, or vs LH/RH (can t think of a
            Message 5 of 7 , May 1, 2004
            • 0 Attachment
              --- tjruane <truane@...> wrote:
              > circumstances). As
              > a result, Retrosheet's RF/CF/LF data will certainly
              > sum to what
              > we would present as OF statistics, but this sum will
              > bear little
              > relationship to the official OF data.
              >
              > Tom Ruane


              Hmmm... this would imply then that any nonofficial
              breakdown that we present, say splitting data between
              starts and reliefs, or vs LH/RH (can't think of a good
              example), etc would fall under a similar category.

              Therefore, it would be necessary that we continue to
              carry redundant data, because the official data is our
              true checkpoint. Interesting...

              Tom






              __________________________________
              Do you Yahoo!?
              Win a $20,000 Career Makeover at Yahoo! HotJobs
              http://hotjobs.sweepstakes.yahoo.com/careermakeover
            • tjruane
              ... Not at all. There s a light year of difference between fielding statistics and the batting/pitching stats. There is close to 100% agreement between the
              Message 6 of 7 , May 1, 2004
              • 0 Attachment
                I wrote:

                > As a result, Retrosheet's RF/CF/LF data will certainly sum to
                > what we would present as OF statistics, but this sum will
                > bear little relationship to the official OF data.

                And Tom replied:

                > Hmmm... this would imply then that any nonofficial
                > breakdown that we present, say splitting data between
                > starts and reliefs, or vs LH/RH (can't think of a good
                > example), etc would fall under a similar category.

                Not at all. There's a light year of difference between fielding
                statistics and the batting/pitching stats. There is close to 100%
                agreement between the data people typically care about and the
                official data. Simply put, the official scorers are careful
                when dealing with homers and hits and runs but pay much less
                attention to such things as what fielder actually caught the ball.

                We typically have a handful of discrepancies between the
                Retrosheet data and official data for batting and pitching each
                season. Almost all of these deal with things like batters
                strikeouts, intentional walks and so on. Even in these cases,
                I'm confident that the Retrosheet is far more accurate than the
                official data. When you derive your statistics from the event
                files, for example, you can't put a strikeout in the caught
                stealing column, or give a GIDP to the wrong batter. Every year
                I proof I run into a few cases where a batter has an impossible
                stat line (1-2 with a strikeout and a GIDP, for example).

                Such is not the case for defensive statistics. Instead of dozens
                of discrepancies as we have with batting and pitching stats, we
                typically have a hundred or more fielding discrepancies--and
                that's only dealing with defensive games played, errors and
                passed balls. What's more, while I'm confident that the
                overwhelming majority of the games played discrepancies are
                official errors (it's not uncommon for there to be two official
                second basemen in a game, for example, and no shortstop), I think
                that many of the discrepancies in errors and passed balls are
                problems with our data. It is not that uncommon for a person
                scoring a game to forget to mark down a passed ball or a dropped
                foul ball, especially if the miscue did not result in a run.

                As for putouts and assists, I suspect that there are both
                official errors and event file errors in just about every game,
                especially when you get back into the 1970s and 1960s. When I
                proofed the 1963 AL (the only time I even attempted to reconcile
                this data), it was extremely rare to find two scoresheets in
                complete agreement about what happened. When there was agreement,
                it often seemed as if the official scorer was watching a different
                game entirely. One typical mistake is for scorers (both ours and
                the official ones) to confuse the numbers for the right and left
                fielder in a game.

                > Therefore, it would be necessary that we continue to
                > carry redundant data, because the official data is our
                > true checkpoint. Interesting...

                I wasn't under the impression that you were REPLACING any data
                with the stuff that I provide. Rather, I thought I was added
                data that you didn't previously have. And just keep in mind
                that while official data may be your true checkpoint, that data
                is far from correct. About the best you could say for it is
                that it is official.

                Sorry for the length of this note.
                Tom Ruane
              • Mike Emeigh
                Tom Ruane wrote: (snip) ... and that is *especially* true with respect to fielding data. Fielding data is especially error-prone, because no one bothered to
                Message 7 of 7 , May 1, 2004
                • 0 Attachment
                  Tom Ruane wrote:
                  (snip)
                  >
                  > And just keep in mind
                  > that while official data may be your true checkpoint, that data
                  > is far from correct.

                  and that is *especially* true with respect to fielding data. Fielding data
                  is especially error-prone, because no one bothered to proof it at the end of
                  the season. Even into the '40s and '50s you find examples of team putout and
                  assist totals that do not match the official totals (although they usually
                  manage to get the error totals correct), and there is absolutely no
                  assurance that defensive replacements who didn't bat were properly recorded
                  as having played.

                  With the availability of online archives like ProQuest and Paper of Record,
                  we have a chance of addressing some of these problems, at least for the
                  period when putouts and assists were being recorded in the box scores
                  (keeping in mind that newspapers boxscores weren't always official, either).

                  Mike Emeigh
                  piratefan1@...
                Your message has been successfully submitted and would be delivered to recipients shortly.