Loading ...
Sorry, an error occurred while loading the content.

4461Re: csv? Re: [baseball-databank] Proposed changes to master table

Expand Messages
  • F. X. Flinn
    Dec 15, 2013
      CSV output with text delimiters is the gold standard. It allows fields with commas and quotes to be captured as fields.

      Consider a simple example:

      RowIDInteger, Description
      1001,The rain in Spain
      1002,Falls Mainly on the Plain
      1003,Except in Gibralter, where it falls mainly in the sea.
      1004,Just sayin'

      The comma after Gibraltar will break the data in row 3 and mess up the import

      But with double quotes as text delimiters in addition to the comma as a field delimiter the problem is avoided:

      1001,"The rain in Spain"
      1002,"Falls Mainly on the Plain"
      1003,"Except in Gibralter, where it falls mainly in the sea."
      1004,"Just sayin'"

      Same thing happens with tab delimited files and tabs in text fields, so that isn't a perfect solution either.

      F. X. Flinn
      FXFlinn@gmail | c:802-369-0069

      On Sun, Dec 15, 2013 at 4:48 PM, Sean Lahman <seanlahman@...> wrote:

      On Sun, Dec 15, 2013 at 4:05 PM, F. X. Flinn <fxflinn@...> wrote:


      It should be possible to supply a tab-delimited version or, even better, a strictly compliant csv file with " surrounding text fields.

      Either of these things are possible, of course.  I'll confess that I'm not aware of a csv standard that requires all text fields to be enclosed.  Are there use cases where that creates a problem?  What about csv rather than tsv? 

      I want to be accomoodating, but I don't want to create multiple variations of file types unless it's necessary. The overriding goal of this project has been to make the data available in the most portable open source format that's available, which people could import/convert for use with any DBMS (MySql, Oracle, dBase, SAP, etc.) or access using the most popular programming languages (python, php, perl, ruby, R, etc.)  

      I have always made the database available in Access format because, generally, the folks who work in Access would struggle with building it themselves.  For more than a dozen years this was the most downloaded format, and it still gets more interest than most of you might expect.  I have made an SQL version available in recent years because it has been so frequently requested, although the majority of people I talk to that work in MySQL or PostgreSQL prefer to import the CSV files.

      But my preference would be to provide a single version that's extremely portable rather than providing downolads in ten different varities.  

      What say you all?  Are there applications where the current csv formats are problematic?


      Sean Lahman

    • Show all 23 messages in this topic