Loading ...
Sorry, an error occurred while loading the content.

want to extract fixed length fields -- but occurring in different orders

Expand Messages
  • Don - HtmlFixIt.com
    I have a number of sets of data. They all contain the same data, but in random orders or of different lengths BETWEEN files, but consistent in one file. I am
    Message 1 of 3 , Jun 27, 2009
    • 0 Attachment
      I have a number of sets of data. They all contain the same data, but in
      random orders or of different lengths BETWEEN files, but consistent in
      one file.

      I am trying to adapt a former clip I used.

      I know that explanation of the problem is sometimes part of the problem.

      I could do this in excel. You open excel, import the data of fixed
      length, it draws lines where it thinks they go (it would be close on
      most of these). Of course that takes longer than simply opening the
      file in notetab and clicking my (soon to be) functioning clip ... in
      addition excel plays with data -- specifically my data includes times
      xx:yy.zz and excel tries to treat it as dates unless I specifically
      alter it to text (or custom format). In addition I need column headings
      and because those can appear in any order potentially, I need to do
      those manually. I then need to delete any extra columns created
      (sometimes there is other data I don't want).

      So in pseudo code here is what I see:
      1. open file ... each line has data I want in some order:
      aaa__bbbbbbbbbb___ccccccccccc_ddddd__xxxx_mm:ss.th_xxx
      Each line is the same in terms of length. In the above example I want
      to capture the a's, b's, c's and the time mm:ss.th. I do not need to
      capture the x's. These are running results, so the a's are the place in
      the run. B's are the persons name, c's are the persons club or school, d
      is there level or grade, x's I don't care about and mm:ss.th is their time.

      the next file however may be thus:
      bbbbbbbbbb___ccccccccccc_(ddddd)__mm:ss.th___xxxx__aaa
      for example. The format can vary but the information is always the same.

      2. go through a process of identifying what data is in what columns -- I
      have a current clip I use to do this with a set of this data that
      appears in a set order ... I highlight and copy the columns and then use
      a replace to get rid of extra stuff.
      ^!Info [C]Copy Overall Place Column and Hit Control C
      ^!SetClipboard

      :LoopPlace
      ^!Delay 5
      ^!If "^$GetClipboardSize$" = "0" LoopPlace
      ^!Set %PlaceCount%=^$GetClipboardSize$

      repeated for each element

      Then I get the data to delimited format thus:
      ^!Replace
      "^.{^%PlaceCount%}(.{^%NameCount%})(.{^%SchoolCount%}).{^%GradeCount%}(.{^%TimeCount%}).*"
      >> "$1\t^%D%\t$2\t$3" RAWS

      I am using the entire string in this scenario because I then remove
      extra trailing spaces in the next step and there were not extra spaces
      except at the end.

      3. Now is where I deviate: I have captured the starting spot and length
      of every field I need ... but now I need to get them in the right order
      ... so somehow I need to sort them, capture them and output them ....

      Anybody still with me? I know I probably didn't explain it all that well.
    • Don - HtmlFixIt.com
      In my old example I think I tossed some of that data that I now wish to keep ... so the replace might be a little off ... I was just trying to give the flavor.
      Message 2 of 3 , Jun 27, 2009
      • 0 Attachment
        In my old example I think I tossed some of that data that I now wish to
        keep ... so the replace might be a little off ... I was just trying to
        give the flavor.

        The only other thing to make clear is that in a given set of data the
        lines are the same length, but from data set to data set they may vary.

        I am capturing the start point of each element correctly and the length
        of each element correctly, so now if I can build a regex on the fly, I'm
        golden, but to build the "variable" regex, I need to sort things out it
        seems and then know which element is which.

        Don - HtmlFixIt.com wrote:
        > I have a number of sets of data. They all contain the same data, but in
        > random orders or of different lengths BETWEEN files, but consistent in
        > one file.
        >
        > I am trying to adapt a former clip I used.
        >
        > I know that explanation of the problem is sometimes part of the problem.
        >
        > I could do this in excel. You open excel, import the data of fixed
        > length, it draws lines where it thinks they go (it would be close on
        > most of these). Of course that takes longer than simply opening the
        > file in notetab and clicking my (soon to be) functioning clip ... in
        > addition excel plays with data -- specifically my data includes times
        > xx:yy.zz and excel tries to treat it as dates unless I specifically
        > alter it to text (or custom format). In addition I need column headings
        > and because those can appear in any order potentially, I need to do
        > those manually. I then need to delete any extra columns created
        > (sometimes there is other data I don't want).
        >
        > So in pseudo code here is what I see:
        > 1. open file ... each line has data I want in some order:
        > aaa__bbbbbbbbbb___ccccccccccc_ddddd__xxxx_mm:ss.th_xxx
        > Each line is the same in terms of length. In the above example I want
        > to capture the a's, b's, c's and the time mm:ss.th. I do not need to
        > capture the x's. These are running results, so the a's are the place in
        > the run. B's are the persons name, c's are the persons club or school, d
        > is there level or grade, x's I don't care about and mm:ss.th is their time.
        >
        > the next file however may be thus:
        > bbbbbbbbbb___ccccccccccc_(ddddd)__mm:ss.th___xxxx__aaa
        > for example. The format can vary but the information is always the same.
        >
        > 2. go through a process of identifying what data is in what columns -- I
        > have a current clip I use to do this with a set of this data that
        > appears in a set order ... I highlight and copy the columns and then use
        > a replace to get rid of extra stuff.
        > ^!Info [C]Copy Overall Place Column and Hit Control C
        > ^!SetClipboard
        >
        > :LoopPlace
        > ^!Delay 5
        > ^!If "^$GetClipboardSize$" = "0" LoopPlace
        > ^!Set %PlaceCount%=^$GetClipboardSize$
        >
        > repeated for each element
        >
        > Then I get the data to delimited format thus:
        > ^!Replace
        > "^.{^%PlaceCount%}(.{^%NameCount%})(.{^%SchoolCount%}).{^%GradeCount%}(.{^%TimeCount%}).*"
        > >> "$1\t^%D%\t$2\t$3" RAWS
        >
        > I am using the entire string in this scenario because I then remove
        > extra trailing spaces in the next step and there were not extra spaces
        > except at the end.
        >
        > 3. Now is where I deviate: I have captured the starting spot and length
        > of every field I need ... but now I need to get them in the right order
        > ... so somehow I need to sort them, capture them and output them ....
        >
        > Anybody still with me? I know I probably didn't explain it all that well.
        >
        >
        > ------------------------------------
        >
        > Fookes Software: http://www.fookes.com/
        > NoteTab website: http://www.notetab.com/
        > NoteTab Discussion Lists: http://www.notetab.com/groups.php
        >
        > ***
        > Yahoo! Groups Links
        >
        >
        >
        >
      • Don - HtmlFixIt.com
        Here is where I am at so far: Dataset looks like this for example one -- 1 Penning, Alyssa 11 Christian Hi 18:51.63 2 Anderson, Jennifer
        Message 3 of 3 , Jun 27, 2009
        • 0 Attachment
          Here is where I am at so far:

          Dataset looks like this for example one --
          1 Penning, Alyssa 11 Christian Hi 18:51.63
          2 Anderson, Jennifer 12 Christian Hi 19:41.88

          These are fixed length lines even if email doesn't show them that way!
          I must have a variable width font or something ...

          Here is my clip:
          ^!Set %D%=^?{Race Division=1|2|_3|4|5|6|7|8|9|10}

          ^!Info [C]Copy Place Columns and Hit Control C
          ^!SetClipboard

          :LoopPlace
          ^!Delay 5
          ^!If "^$GetClipboardSize$" = "0" LoopPlace
          ^!Set %PlaceCount%=^$GetClipboardSize$
          ^!Jump Select_Start
          ^!Set %PlaceStart%="^$GetCol$"
          ^!Info [C] from ^%PlaceStart% to ^%PlaceCount%


          ^!Info [C]Copy Name Columns and Hit Control C
          ^!SetClipboard

          :LoopName
          ^!Delay 5
          ^!If "^$GetClipboardSize$" = "0" LoopName
          ^!Set %NameCount%=^$GetClipboardSize$
          ^!Jump Select_Start
          ^!Set %NameStart%="^$GetCol$"
          ^!Info [C] from ^%NameStart% for ^%NameCount%


          ^!Info [C]Copy School Columns and Hit Control C
          ^!SetClipboard

          :LoopSchool
          ^!Delay 5
          ^!If "^$GetClipboardSize$" = "0" LoopSchool
          ^!Set %SchoolCount%=^$GetClipboardSize$
          ^!Jump Select_Start
          ^!Set %SchoolStart%="^$GetCol$"
          ^!Info [C] from ^%SchoolStart% for ^%SchoolCount%


          ^!Info [C]Copy Grade Columns and Hit Control C
          ^!SetClipboard

          :LoopGrade
          ^!Delay 5
          ^!If "^$GetClipboardSize$" = "0" LoopGrade
          ^!Set %GradeCount%=^$GetClipboardSize$
          ^!Jump Select_Start
          ^!Set %GradeStart%="^$GetCol$"
          ^!Info [C] from ^%GradeStart% for ^%GradeCount%


          ^!Info [C]Copy Time Columns and Hit Control C
          ^!SetClipboard

          :LoopTime
          ^!Delay 5
          ^!If "^$GetClipboardSize$" = "0" LoopTime
          ^!Set %TimeCount%=^$GetClipboardSize$
          ^!Jump Select_Start
          ^!Set %TimeStart%="^$GetCol$"
          ^!Info [C] from ^%TimeStart% for ^%TimeCount%

          :BuildColumns
          ^!Info [C] Place starts at ^%PlaceStart% and runs for ^%PlaceCount%
          characters^%NL%Name starts at ^%NameStart% and runs for ^%NameCount%
          spaces^%NL%School starts at ^%SchoolStart% and runs for ^%SchoolCount%
          spaces^%NL%Grade starts at ^%GradeStart% and runs for ^%GradeCount%
          spaces^%NL%Time starts at ^%TimeStart% and runs for ^%TimeCount% spaces


          ^!Goto end

          The output is now thus:

          Place starts at 1 and runs for 3 characters
          Name starts at 5 and runs for 26 spaces
          School starts at 31 and runs for 2 spaces
          Grade starts at 34 and runs for 22 spaces
          Time starts at 56 and runs for 8 spaces

          So I think I want to build this regex on the fly ...

          ^!Replace "^(.{3}).{1}(.{26})(.{2}).{1}(.{22})(.{8}).*" >>
          "^%D%\t$1\t$2\t$3\t$4\t$5" RAWS

          and I strip all trailing and leading spaces and get this output.
          ^!Replace "\t +" >> "\t" RAWS
          ^!Replace " +\t" >> "\t" RAWS

          and I get what I want:
          3 1 Penning, Alyssa 11 Christian Hi 18:51.63
          3 2 Anderson, Jennifer 12 Christian Hi 19:41.88

          In the above regex ... you will see that the 3 in the first element is
          the number of Place characters ... then we need to skip one because 1,
          2, and 3 are place, but Name doesn't start until 4 and then runs 26 spaces.

          So how do I order the fields (they can appear in any order in a given
          set of data ... and how do I build the regex on the fly to react to that?

          My next data set could look like this:
          18:51.63 Penning, Alyssa 11 Christian High 1
          19:41.88 Anderson, Jennifer 12 Christian High 2


          Place starts at 56 and runs for 3 characters
          Name starts at 11 and runs for 27 spaces
          School starts at 41 and runs for 15 spaces
          Grade starts at 38 and runs for 2 spaces
          Time starts at 1 and runs for 8 spaces


          Don - HtmlFixIt.com wrote:
          > In my old example I think I tossed some of that data that I now wish to
          > keep ... so the replace might be a little off ... I was just trying to
          > give the flavor.
          >
          > The only other thing to make clear is that in a given set of data the
          > lines are the same length, but from data set to data set they may vary.
          >
          > I am capturing the start point of each element correctly and the length
          > of each element correctly, so now if I can build a regex on the fly, I'm
          > golden, but to build the "variable" regex, I need to sort things out it
          > seems and then know which element is which.
          >
        Your message has been successfully submitted and would be delivered to recipients shortly.