Loading ...
Sorry, an error occurred while loading the content.

RE: [Clip] Need to extract surnames; The DOM vs reg exp

Expand Messages
  • Piotr Bienkowski
    Wonded why my message popped up with a few days delay... :(
    Message 1 of 30 , Feb 1, 2001
    • 0 Attachment
      Wonded why my message popped up with a few days' delay... :(

      On 25 Jan 2001, at 11:33, Piotr Bienkowski wrote:

      > On 18 Jan 2001, at 11:19, Grant wrote:
      >
      > > If you want to take a look posted a DOM way to do this on the
      > > Notetab html list. subj: Extracting table data with the DOM Have
      > > heard 'reg exp' losing favour because of the power of the DOM. It
      > > certainly is a lot more intuitive than writing a reg exp to do the
      > > same thing. So if you want to check it out have a look.
      > >
      > >
      > Hi,
      >
      > I take interest in both DOM and regexes. DOM can get you the contents
      > of a tag, but can it check if these contens match a particular
      > pattern?
      >
      > Piotr
      >
    • Gauffin Claes
      Hello Martha, ... First: Eb s clip is not quite complete. It uses a clever regular expression to catch the surnames from your data, but does not deal with what
      Message 2 of 30 , Feb 1, 2001
      • 0 Attachment
        Hello Martha,

        You wrote:

        > Thank you all for responding. I made Claes' corrections to Eb's clip
        > but cannot get it to work at all. What am I doing wrong?
        >
        > :loop
        > ; find (and select) the TD tag, that contains ","
        > ^!Find <TD [^/]+>[^,]+,[^<]*</TD> RSI
        > ^!IfError Done
        > ; strip off the html
        > ^!Set %name%=^$StrStripHTML("^$GetSelection$";0)$
        > ; copy the name to (including) the comma
        > ^!Set %Lname%=^$StrCopy(^%name%;1;^$StrPos(",";^%name%;1)$)$
        > ; strip off the comma
        > ^!Set %Lname%=^$StrDeleteRight(^%Lname%;1)$
        > ^!Goto loop
        > :Done
        >

        First:
        Eb's clip is not quite complete. It uses a clever regular expression to
        catch
        the surnames from your data, but does not deal with what you do with the
        extracted names. One possible completion of his clip could be this
        which places the extracted names in a new document:

        H="Extract surnames (version 1)"
        ^!ClearVariable %a%
        ^!Jump text_start
        :loop
        ; find (and select) the TD tag, that contains ","
        ^!Find <TD [^/]+>[^,]+,[^<]*</TD> RSI
        ^!IfError Done
        ; strip off the html
        ^!Set %name%=^$StrStripHTML("^$GetSelection$";0)$
        ; copy the name to (including) the comma
        ^!Set %Lname%=^$StrCopy(^%name%;1;^$StrPos(",";^%name%;1)$)$
        ; strip off the comma
        ^!Set %Lname%=^$StrDeleteRight(^%Lname%;1)$
        ^!append %a%=^%Lname%^p
        ^!Goto loop
        :Done
        ^!Toolbar New Document
        ^!Inserttext ^%a%


        Second:
        The regular expression approach suffers from being rather slow when
        executing. Therefore I would like to push a bit for the strip-html approach.

        Third:
        There seems to be some confusion on what it is you really want.
        Your mails say "surnames" but in a previous mail you wrote
        >...
        >When what I want is this:
        >
        >Reitz, Ed. G.
        >Reitz, Ida A.
        >Reitz, Edward W.
        >...
        which indicates that you want the full names.
        Eb's clip extracts the surnames.

        The following is a clip using html-strip (therefore quite fast) which
        will extract the full names.
        The result will be sorted. When sorting, you can choose whether
        you want duplicates removed or not.
        This is done by checking or unchecking
        View > Options > Tools > Sort Removes Duplicates
        If you do want just the surnames, this can be done too.

        H="Extract surnames (version 2)"
        ^!SetHintInfo Working...
        ^!SetScreenUpdate Off
        ^!SetWordWrap OFF
        ^!Jump TEXT_START
        ^!Select ALL
        ^!Keyboard SHIFT+CTRL+T
        ^!Replace "^t" >> "^pzzzzz" TWSAI
        ^!ToolBar Sort Ascending
        ^!Find "zzzzz" TIWS
        ^!Set %r%=^$getrow$
        ^!Jump text_end
        ^!SelectTo ^%r%:1
        ^!Keyboard DELETE

        Regards /Claes
      • Grant
        ... No it can t, but using reg exp to extract a tables first col surnames in an html doc is like using a chainsaw to cut butter. In comparison it took me about
        Message 3 of 30 , Feb 1, 2001
        • 0 Attachment
          > > If you want to take a look posted a DOM way to do this on the Notetab
          > > html list. subj: Extracting table data with the DOM Have heard 'reg
          > > exp' losing favour because of the power of the DOM. It certainly is a
          > > lot more intuitive than writing a reg exp to do the same thing. So if
          > > you want to check it out have a look.

          > I take interest in both DOM and regexes. DOM can get you the contents
          > of a tag, but can it check if these contents match a particular
          > pattern?

          No it can't, but using reg exp to extract a tables first col surnames in an
          html doc is like using a chainsaw to cut butter.
          In comparison it took me about 5 minutes to write that dom script to extract
          the tables first collum data because it's the right tool for this job.
          The dom provides an easy way to navigate text marked up with html or xhtml
          or xml while Reg expressions are good at finding patterns in the
          unstructured text. They are not competing technologies but complementary.
          Working with the dom I'm not pattern matching but working directly with the
          documents structured objects.
          the tables collection of rows and the first child of each row, to get the
          first td column.
          Having extracted the first col, if I want to find all the 'parkers' in that
          extracted data then using reg ex is handy.
        • Jody
          Hi Martha, ... It has been so long now I forget what it was and can t find it. I know it worked on whatever you sent in. At the present I do not have time for
          Message 4 of 30 , Feb 1, 2001
          • 0 Attachment
            Hi Martha,

            >I tried this, too. It stripped the HTML tags but it left
            >everything in a single column. I could take out several of them
            >but not all, by using search and replace. This can't be what you
            >mean because it took me more than a few seconds. Would you
            >please be a little more specific about what I need to do?

            It has been so long now I forget what it was and can't find it.
            I know it worked on whatever you sent in. At the present I do
            not have time for it though. Maybe the others are not working
            for you either because what you are sending in is not the same as
            what you are running the Clip over.

            I just saw you got it another way, so whatever works! :)

            Happy Clip'n!
            Jody

            http://www.notetab.net

            Subscribe, UnSubscribe, Options
            mailto:Ntb-Clips-Subscribe@yahoogroups.com
            mailto:Ntb-Clips-UnSubscribe@yahoogroups.com
            http://www.egroups.com/group/ntb-clips
          • Piotr Bienkowski
            ... Righto! Chisels are not for fixing tractors. :) Piotr
            Message 5 of 30 , Feb 3, 2001
            • 0 Attachment
              On 2 Feb 2001, at 10:22, Grant wrote:

              > No it can't, but using reg exp to extract a tables first col surnames
              > in an html doc is like using a chainsaw to cut butter. In comparison
              > it took me about 5 minutes to write that dom script to extract the
              > tables first collum data because it's the right tool for this job.

              Righto! Chisels are not for fixing tractors. :)

              Piotr
            • Jody
              Hi Piotr, ... It appears a few of them just got spit out. ... Happy Clip n! Jody http://www.notetab.net Subscribe, UnSubscribe, Options
              Message 6 of 30 , Feb 5, 2001
              • 0 Attachment
                Hi Piotr,

                >Wonded why my message popped up with a few days' delay... :(

                It appears a few of them just got spit out.

                > > If you want to take a look posted a DOM way to do this on the
                > > Notetab html list. subj: Extracting table data with the DOM
                > > Have heard 'reg exp' losing favour because of the power of the
                > > DOM.

                Happy Clip'n!
                Jody

                http://www.notetab.net

                Subscribe, UnSubscribe, Options
                mailto:Ntb-Clips-Subscribe@yahoogroups.com
                mailto:Ntb-Clips-UnSubscribe@yahoogroups.com
                http://www.egroups.com/group/ntb-clips
              • Luuk.Houwen@t-online.de
                I would like to count the number of times my program gies through a loop. I tried the following line within the loop, but it does not work. Any ideas about
                Message 7 of 30 , Feb 5, 2001
                • 0 Attachment
                  I would like to count the number of times my program gies through a loop. I
                  tried the following line within the loop, but it does not work. Any ideas
                  about improving it?

                  ^!Set %Counter%=^$Calc(x=x+1)$

                  Luuk
                Your message has been successfully submitted and would be delivered to recipients shortly.