Loading ...
Sorry, an error occurred while loading the content.

Parsing & massaging data files...

Expand Messages
  • Bill Harriff
    I need to parse a plain text data file that has no separators except for the layout of the data. Here s how each record is laid out IF all fields have entries:
    Message 1 of 1 , Apr 20, 2006
    • 0 Attachment
      I need to parse a plain text data file that has no separators except
      for the layout of the data. Here's how each record is laid out IF all
      fields have entries:
      Line 1, chars 1-12 (last name); chars 13-21 (first name);
      chars 22-26 (mid initial); chars 27-48 (address 1); chars 49-62
      (city); chars 63-64 (State)<return>
      Line 2, chars 1-67 (empty spaces); chars 68-74 (zipcode); chars
      75-87 (phone #); chars 88-97 (dob in mm/dd/yyyy format)<return>
      Line 3, chars 1-12 (SSN).

      If data is missing then the <returns> may occur at different,
      inconsistent points and the line length may be a different
      length that a fully populated record (e.g., see the 1st record
      below for John Sample that is missing 7 or the 10 fields). I do get
      3 lines for each record even if fields are missing.


      Here are 5 sample records with different fields missing from 3 and
      complete data for 2. This is the before view (be sure to turn off word
      wrap):

      SAMPLE JOHN

      (503)321-7654

      BALLSTON ANN 135 SAINT ST. TIGARD OR
      97111
      (503)666-1111111/21/1950
      111-22-3333
      Smith JOHN A 1245 E. Smith DR. Portland OR
      97333
      (503)987-654305/09/1960
      999-88-7777
      JOHNSON PAUL E 375 VANCOUVER ST #3 PORTLAND OR
      97222
      07/05/1983
      666-55-1212
      BUSH HERATIO 747 SKYWAY LANE VANCOUVER WA
      97666
      (701)410-1235
      345-67-8900

      Here is what I want to get it to (notice the substitutions for missing
      data):

      "LastName","FirstName","MI","Address 1","City","ST","Zipcode","Phone
      #","DOB","SSN"
      "SAMPLE","JOHN","z","unknown","unknown","XX","00000","(503)321-7654","00/00/1900","000-00-0000"
      "BALLSTON","ANN","z","135 SAINT
      ST.","TIGARD","OR","97111","(503)666-11111","11/21/1950","111-22-3333"
      "Smith","JOHN","A","1245 E. Smith
      DR.","Portland","OR","97333","(503)987-6543","05/09/1960","999-88-7777"
      "JOHNSON","PAUL","E","375 VANCOUVER ST
      #3","PORTLAND","OR","97222","(000)000-0000","07/05/1983","666-55-1212"
      "BUSH","HERATIO","z","747 SKYWAY
      LANE","VANCOUVER","WA","97666","(701)410-1235","00/00/1900","345-67-8912"

      The office mgr wants dummy values in the field because the new db vendor
      says it's easier to do reports for incorrect/missing values if there are
      standard dummy place holders, hence...
      - "z" for a missing MI
      - "97000" for a missing zipcode
      - "(000)000-0000" for a missing phone #
      - "00/00/1900" for a missing dob
      - "000-00-0000" for a missing ssn

      Note that there are records with 4 different states (OR, WA, CA, ID) and
      some records don't include the state.

      There are 2000 + records & so far I can't get a very clean conversion,
      so I thought I'd ask about using regex to count chars and/or line
      length as well as looking for patterns/missing patterns to determine
      where to put the quotes & delimiters. I did a 500+ list of vendors
      earlier this week & it wound up being a bad mix of auto & manual
      'search & replace' and then manual reviewing of the field count,
      so what should have been a 20 minute job took over 3 hrs.

      I'd like to avoid that this time so I thought that maybe this was a good
      time to tackle regex with some help, of course. Thanks for any help you
      can provide.

      Bill Harriff
    Your message has been successfully submitted and would be delivered to recipients shortly.