Loading ...
Sorry, an error occurred while loading the content.

Re: [Clip] Remove all lines not containing something

Expand Messages
  • Don
    ... I can use a clip for this project, but I wonder if a regex could do it. I want to essentially ask for data that I do not want to find in a line and if that
    Message 1 of 30 , Jun 28, 2011
    • 0 Attachment
      > **** I should have been clearer when I initially posted the question
      > that I was *NOT* looking for a clip but simply a Regex pattern. ****

      I can use a clip for this project, but I wonder if a regex could do it.

      I want to essentially ask for data that I do not want to find in a line
      and if that data is not in the line, delete the line.

      We conclude the other day that negative classes were character at a time
      (I think).

      I was trying something like this (not expecting it to quite work obviously):
      ^!Set %DataStuff%=org|net
      ^!Replace ".*[^%DataStuff].*\r\n" >> "" RAWH

      For the moment assume I want to remove all .com email addresses from a
      list (in the end I want to prompt for the search phrase).

      I was doing it with a stepped clip, but it keeps jumping out of the loop
      half way through a job. I think it may be a keyboard delay issue as I
      use the backspace key.

      Input:
      john@...
      jane@...
      jeff@...
      fred@...

      Output:
      jane@...
      fred@...

      Existing clip:
      ^!Jump Doc_Start
      ^!Set %DataStuff%=^?{Term to Search For, Pipe Separated}
      :Loop
      ^!Select Eol
      ^!If "^$GetSelection$" <> "" Skip_2
      ^!Keyboard DELETE
      ^!Goto Advance

      ^!Find "^%DataStuff%" TIHRS
      ^!IfError Next ELSE JumpLine
      ^!DeleteLine
      ^!If ^$GetRow$ = ^$GetLinecount$ End
      ^!Goto Advance

      :JumpLine
      ^!If ^$GetRow$ = ^$GetLinecount$ Next ELSE Skip_2
      ^!DeleteLine
      ^!Goto End
      ^!Jump +1

      :Advance
      ;end at end of file
      ^!Goto Loop


      This clip works pretty well but derails occasionally -- again I think a
      keyboard issue. It also misses the last line so I need to think that
      through better.
    • Axel Berger
      ... Is there any reason why noone so far has suggested making use of the intrinsic functions ^$GetFileName(FileName)$ and ^$GetPath(FileName)$ ? With these
      Message 2 of 30 , Jun 28, 2011
      • 0 Attachment
        Alec Burgess wrote:
        > The Input string against which the regexp must work is a complete
        > <pathname>\<filename>.<ext> where both path-name and filename *MAY*
        > contain spaces.

        Is there any reason why noone so far has suggested making use of the
        intrinsic functions ^$GetFileName(FileName)$ and ^$GetPath(FileName)$ ?
        With these it's easy to separate the name and only to work on that.

        Axel
      • diodeom
        ... If you want to remove the lines that don t contain certain terms (e.g. either .org or .net), you could avoid any hassle of actually looking for them by
        Message 3 of 30 , Jun 28, 2011
        • 0 Attachment
          Don wrote:
          >
          > I want to essentially ask for data that I do not want to find in a line
          > and if that data is not in the line, delete the line.
          >

          If you want to remove the lines that don't contain certain terms (e.g. either .org or .net), you could avoid any hassle of actually looking for them by instead just slurping the ones containing the "keeper" strings and pasting them over the selection.

          For example:

          ^!Select All
          ^!SetListDelimiter ^p
          ^$GetDocMatchAll("^.*\.(org|net).*$")$
        • Alec Burgess
          Axel: see my previous response ... its *NOT* for Notetab and hence, those functions are not available. Even if it *was* for notetab, those functions would
          Message 4 of 30 , Jun 28, 2011
          • 0 Attachment
            Axel: see my previous response ... its *NOT* for Notetab and hence,
            those functions are not available. Even if it *was* for notetab, those
            functions would require either a loop on line-by-line processing. CORRECT?

            Dollars to Donuts (Euros to appfel strudel?) the single regex would be
            faster <grin>

            On 2011-06-28 14:36, Axel Berger wrote:
            > Alec Burgess wrote:
            > > The Input string against which the regexp must work is a complete
            > > <pathname>\<filename>.<ext> where both path-name and filename *MAY*
            > > contain spaces.
            >
            > Is there any reason why noone so far has suggested making use of the
            > intrinsic functions ^$GetFileName(FileName)$ and ^$GetPath(FileName)$ ?
            > With these it's easy to separate the name and only to work on that.
            Actually I think someone did but haven't re-checked the thread.
          • Axel Berger
            ... Got it, and will shut up now. Axel
            Message 5 of 30 , Jun 28, 2011
            • 0 Attachment
              Alec Burgess wrote:
              > its *NOT* for Notetab

              Got it, and will shut up now.

              Axel
            • Don
              ... You are brilliant and I thank you! I can t say how many time people on this list, you included as often as anyone, except perhaps the dear departed Jody
              Message 6 of 30 , Jun 28, 2011
              • 0 Attachment
                On 6/28/2011 2:38 PM, diodeom wrote:
                > ^!Select All
                > ^!SetListDelimiter ^p
                > ^$GetDocMatchAll("^.*\.(org|net).*$")$

                You are brilliant and I thank you! I can't say how many time people on
                this list, you included as often as anyone, except perhaps the dear
                departed Jody and Sheri, have point out the obvious way of doing
                something I sit there head scratching over.

                ^!Set %DataTested%=^?{RegEx Term to Search For, Pipe Separated "or"}
                ^!Select All
                ^!SetListDelimiter ^p
                ^!Set %DataOutput%="^$GetDocMatchAll("^.*(^%DataTested%).*$")$"
                ^!InsertText ^%DataOutput%

                Perfect!
              • flo.gehrke
                ... Alec, No problem -- your posting was perfectly clear. My last message (#21834), however, was a reply to tf/Frank/acmewebwerks . It didn t address your
                Message 7 of 30 , Jun 29, 2011
                • 0 Attachment
                  --- In ntb-clips@yahoogroups.com, Alec Burgess <buralex@...> wrote:
                  >
                  > Hi Frank/Flo:
                  > Yes this was my original question....
                  > **** I should have been clearer when I initially
                  > posted the question that I was *NOT* looking for
                  > a clip but simply a Regex pattern. ****

                  Alec,

                  No problem -- your posting was perfectly clear.

                  My last message (#21834), however, was a reply to "tf/Frank/acmewebwerks". It didn't address your original question but an issue that was posted by Frank on June 19 (#21816). For an unknown reason, he posted the same message again on June 28 (#21827) -- or was this a "Yahoo trick"?

                  Maybe he could clarify this confusion and answer whether those clips are matching *his* needs or not.

                  Regards,
                  Flo
                • diodeom
                  ... I m not much for pastry, but I wouldn t mind high-rolling on crunchy Reibekuchen. I d propose that the apparent efficiency of PCRE, a
                  Message 8 of 30 , Jun 29, 2011
                  • 0 Attachment
                    Alec Burgess wrote:
                    >
                    > (...) Even if it *was* for notetab, those functions would require
                    > either a loop on line-by-line processing. CORRECT?
                    >
                    > Dollars to Donuts (Euros to appfel strudel?) the single regex would
                    > be faster <grin>
                    >

                    I'm not much for pastry, but I wouldn't mind high-rolling on crunchy Reibekuchen. <dreamy smile>

                    I'd propose that the apparent efficiency of PCRE, a streamlined C library versus comparatively slow iterations of highly interpreted Clip lingo has many of us "speed junkies" afflicted by LAS, the loop avoidance syndrome. In my severe case (and within the tunnel vision of my needs) I often see Clips just as a mere convenient interface to the mesmerizing powers of the "proper beast of burden," RegEx. And I ain't much apologetic about it. Only sporadic inklings of broader perspective remind me that, apples to apples, there is nothing slow about loops... when written in C. And even though within the contrastingly "lethargic" realm of Clips a disdain for these indispensable constructs often has pragmatic justification, I'm afraid that LAS makes me occasionally miss out on conceptually more straightforward solutions. Is there a pill for that?

                    To follow your digression "Even if it *was* for notetab," I think one would still need to cycle through a directory of files to rename them one by one, whether the new names were acquired from some RegEx-manipulated list or not. In this context, right along Axel's observation, the core iterated portion of a simple file-renaming clip (that you didn't ask for to begin with, I know, I know) could look as follows:

                    ^!RenameFile "^%f%" "^%p%^$StrReplace(" ";"";^$GetFileName(^%f%)$;0;0)$"

                    ... where %f% stands for complete filename and %p% for its path portion. I believe this line framed in a loop of either of the GetFile methods could quite efficiently and elegantly accomplish the task delegated to MultiRen you were originally helping out with.
                  • flo.gehrke
                    ... diodeom, You are absolutely right, of course. But allow me some hair-splitting here ;-) To delete a line if that data is not in the line (Don) would be a
                    Message 9 of 30 , Jun 29, 2011
                    • 0 Attachment
                      --- In ntb-clips@yahoogroups.com, "diodeom" <diomir@...> wrote:
                      >
                      > Don wrote:
                      > >
                      > > I want to essentially ask for data that I do not want to find in a line
                      > > and if that data is not in the line, delete the line.
                      > >
                      >
                      > If you want to remove the lines that don't contain certain terms (e.g. either .org or .net), you could avoid any hassle of actually looking for them by instead just slurping the ones containing the "keeper" strings and pasting them over the selection.
                      >
                      > For example:
                      >
                      > ^!Select All
                      > ^!SetListDelimiter ^p
                      > ^$GetDocMatchAll("^.*\.(org|net).*$")$
                      >

                      diodeom,

                      You are absolutely right, of course. But allow me some hair-splitting here ;-)

                      To delete a line "if that data is not in the line (Don)" would be a kind of *negative* definition of search criteria.

                      According with Don's sample, we wouldn't search positively for 'org' or 'net', but we literally would delete all lines which do not contain 'com' or 'ru'. So, based on your approach, another and possibly more precise solution could be...

                      ^!Select All
                      ^!SetListDelimiter ^p
                      ^$GetDocMatchAll("^.+$(?<!com|ru)")$

                      It also avoids the '.*' that we see in your line...

                      ^$GetDocMatchAll("^.*\.(org|net).*$")$

                      which tests for something that, in Don's list, actually never occurs between the domain and the end of line. I assume this was meant as a trick to work around the error that NT (Pro 6.2) makes with

                      ^$GetDocMatchAll("^.*\.(org|net)$")$

                      In this case, NT stumbles over the '$' inside the parentheses. To avoid any error that might be caused by something that possibly follows the domain we could change your line to...

                      ^$GetDocMatchAll("^.*\.(org|net)^%Dollar%")$

                      BTW, also a simple one-liner might perform well...

                      ^!Replace "^.+(ru|com)(\R|\Z)" >> "" WARS

                      Regards,
                      Flo
                    • diodeom
                      ... Well, of course -- provided that you somehow know all the terms that would mark lines for deletion. I understand the objective to be removal of lines in
                      Message 10 of 30 , Jun 29, 2011
                      • 0 Attachment
                        Flo wrote:
                        >
                        > BTW, also a simple one-liner might perform well...
                        >
                        > ^!Replace "^.+(ru|com)(\R|\Z)" >> "" WARS
                        >

                        Well, of course -- provided that you somehow know all the terms that would mark lines for deletion. I understand the objective to be removal of lines in which certain terms are absent.
                      • Axel Berger
                        ... Actually I have never noticed loops slowing things down. I have one file that s updated form time to time and now has 1.1 MB in 30.000 lines. It is big
                        Message 11 of 30 , Jun 29, 2011
                        • 0 Attachment
                          diodeom wrote:
                          > has many of us "speed junkies" afflicted by LAS,
                          > the loop avoidance syndrome.

                          Actually I have never noticed loops slowing things down.

                          I have one file that's updated form time to time and now has 1.1 MB in
                          30.000 lines. It is big enough for consecutive ^!Replace to become
                          visible. The one single thing that really slows clips down are frequent
                          (several thousand in this case) ^!InsertSelect and ^!InsertText commands
                          over a selection.

                          But then I have thought about how to write an editor, just as a mental
                          exercise, and replacing one string by another of different length in the
                          middle of a text at any speed at all tied my brains in a knot.
                          Considering that, NoteTab's speed is quite remarkable as it is.

                          Axel
                        • diodeom
                          ... A funky one-line take could be something like: ^!Replace ^((.* .(org|net).*)|.++)( R| Z)(?(2) K) WARS ... where the (?(2) K) bit is a conditional
                          Message 12 of 30 , Jun 29, 2011
                          • 0 Attachment
                            I wrote:
                            >
                            > I understand the objective to be removal of lines in which certain
                            > terms are absent.
                            >

                            A funky one-line take could be something like:

                            ^!Replace "^((.*\.(org|net).*)|.++)(\R|\Z)(?(2)\K)" >> "" WARS

                            ... where the "(?(2)\K)" bit is a conditional subpatern that checks if $2, that is the "(.*\.(org|net).*)" substring was captured, and if so, resets the whole capture with \K to nothing, so the empty replacement leaves this line intact. When $2 isn't captured, the selection remains ready for its subsequent wipe-out.
                          • diodeom
                            ... I have to admit that I didn t consider the condition where the keeper terms are always at the line s end -- despite the provided sample data. Your
                            Message 13 of 30 , Jun 30, 2011
                            • 0 Attachment
                              Flo wrote:
                              >
                              > BTW, also a simple one-liner might perform well...
                              >
                              > ^!Replace "^.+(ru|com)(\R|\Z)" >> "" WARS
                              >

                              I have to admit that I didn't consider the condition where the "keeper" terms are always at the line's end -- despite the provided sample data. Your look-behind is a beautifully simple solution for this case.
                            • diodeom
                              ... Sorry, Flo -- I quoted the wrong fragment. Here s your pattern I m referring to, placed in a swap statement: ^!Replace ^.+$(?
                              Message 14 of 30 , Jun 30, 2011
                              • 0 Attachment
                                I wrote:
                                >
                                > I have to admit that I didn't consider the condition where the "keeper" terms are always at the line's end -- despite the provided sample data. Your look-behind is a beautifully simple solution for this case.
                                >

                                Sorry, Flo -- I quoted the wrong fragment. Here's your pattern I'm referring to, placed in a swap statement:

                                ^!Replace "^.+$(?<!org|net)(\R|\Z)" >> "" WARS
                              • Don
                                ... To be clearer ;-) I only provided a sample, it might be much different and I don t know the data going in -- so I literally only know what I want, not what
                                Message 15 of 30 , Jun 30, 2011
                                • 0 Attachment
                                  On 6/30/2011 7:49 AM, diodeom wrote:
                                  > I wrote:
                                  >>
                                  >> I have to admit that I didn't consider the condition where the "keeper" terms are always at the line's end -- despite the provided sample data. Your look-behind is a beautifully simple solution for this case.
                                  >>
                                  >
                                  > Sorry, Flo -- I quoted the wrong fragment. Here's your pattern I'm referring to, placed in a swap statement:
                                  >
                                  > ^!Replace "^.+$(?<!org|net)(\R|\Z)" >> "" WARS

                                  To be clearer ;-)

                                  I only provided a sample, it might be much different and I don't know
                                  the data going in -- so I literally only know what I want, not what I
                                  don't want. There are hundreds of lines if not thousands sometimes
                                  where I use this. Often it is for race results of running races where I
                                  want to extract a particular team. So we cannot count on \R immediately
                                  after the search term. I assume \Z is file end? Have to look that one
                                  up. So it may appear ANYWHERE IN THE LINE, not only at the end.

                                  I misplaced Flo's email before I responded to it. I'll dig it out again
                                  later as it did have a lot of good stuff in it.
                                • flo.gehrke
                                  ... Thanks, diodeom! We have often been asked to explain such patterns to members who are less acquainted with RegEx. So let me append...
                                  Message 16 of 30 , Jun 30, 2011
                                  • 0 Attachment
                                    --- In ntb-clips@yahoogroups.com, "diodeom" <diomir@...> wrote:
                                    >
                                    > Sorry, Flo -- I quoted the wrong fragment. Here's your pattern I'm referring to, placed in a swap statement:
                                    >
                                    > ^!Replace "^.+$(?<!org|net)(\R|\Z)" >> "" WARS

                                    Thanks, diodeom! We have often been asked to explain such patterns to members who are less acquainted with RegEx. So let me append...

                                    ^.+$(?<!org|net)(\R|\Z)

                                    ^ = assertion matching at the start of line
                                    .+ = one or more characters of any type (except NL)
                                    $ = end of line

                                    When arriving at the end of line the RegEx Engine tests...

                                    (?<!org|net) = Negative Lookbehind Assertion matching a position where you do NOT see 'org' or 'net' when looking behind

                                    (\R|\Z) = alternation matching a CRNL or the end of string

                                    In this case, we have genuine negative search criteria in the sense of Don's original question ("Removing lines not containing something"). So to speak, the RegEx is able "to find something that is not there" ;-)

                                    Regards,
                                    Flo
                                  • diodeom
                                    ... Life would be perfect if lookbehinds could accept variable-length patterns... To not capture the term located anywhere in the line (e.g. in John
                                    Message 17 of 30 , Jun 30, 2011
                                    • 0 Attachment
                                      Flo wrote:
                                      >
                                      > (?<!org|net) = Negative Lookbehind Assertion matching a position where you do NOT see 'org' or 'net' when looking behind
                                      >

                                      Life would be perfect if lookbehinds could accept variable-length patterns...

                                      To "not capture" the term located anywhere in the line (e.g. in "John Doe;john@...;555-555-5555") a lookahead could offer the necessary flexibility:

                                      ^!Replace "^(?!.*(org|net).*$).+(\R|\Z)" >> "" WARS
                                    • flo.gehrke
                                      ... Hi Don, You gave us two conditions now... ... and ... Sorry, that s a little bit inconsistent, isn t it? Never mind! There s certainly a way to resolve
                                      Message 18 of 30 , Jun 30, 2011
                                      • 0 Attachment
                                        --- In ntb-clips@yahoogroups.com, Don <don@...> wrote:

                                        > To be clearer ;-)
                                        >
                                        > I only provided a sample, it might be much different and I
                                        > don't know the data going in...

                                        Hi Don,

                                        You gave us two conditions now...

                                        > I want to essentially ask for data that I do not want to find
                                        > in a line... (#21836)

                                        and

                                        > ...so I literally only know what I want, not what I
                                        > don't want. (#21850).

                                        Sorry, that's a little bit inconsistent, isn't it?

                                        Never mind! There's certainly a way to resolve your task -- even if it differs from the sample data in your first message and if your search criteria have to match "anywhere in the line".

                                        You know that it would be helpful to see some more sample data...

                                        Regards,
                                        Flo
                                      • diodeom
                                        ... Don s justly universal anywhere in the line intent is probably most apparent in his outline .*[^%DataStuff%].* r n (which I followed with the same .*
                                        Message 19 of 30 , Jun 30, 2011
                                        • 0 Attachment
                                          Flo wrote:
                                          >
                                          > There's certainly a way to resolve your task -- even if it differs from the sample data in your first message and if your search criteria have to match "anywhere in the line".
                                          >

                                          Don's justly universal "anywhere in the line" intent is probably most apparent in his outline ".*[^%DataStuff%].*\r\n" (which I followed with the same .* accommodations in each offered solution).
                                        • Don
                                          ... You may be right -- this is hard to explain. I guess ... I think ... maybe ... what I want is to say in the end leave all lines that contain one or more
                                          Message 20 of 30 , Jun 30, 2011
                                          • 0 Attachment
                                            > Hi Don,
                                            >
                                            > You gave us two conditions now...
                                            >
                                            >> I want to essentially ask for data that I do not want to find
                                            >> in a line... (#21836)
                                            >
                                            > and
                                            >
                                            >> ...so I literally only know what I want, not what I
                                            >> don't want. (#21850).

                                            You may be right -- this is hard to explain. I guess ... I think ...
                                            maybe ... what I want is to say in the end leave all lines that contain
                                            one or more pieces of data, and delete/remove all other lines.

                                            I tried to come up with an easy example, but didn't mean to limit the
                                            project to that particular set of data. I hope this can be universal.

                                            I think we are getting really close.

                                            In fact the three line clip I posted yesterday or so does what I want
                                            ... I think.

                                            But this negative and positive look about stuff is interesting and worth
                                            discussion and my learning even more.

                                            Your look behind works only if it is at the end of the line.

                                            I will give some examples:
                                            http://michigancrosscountry.com/wp-content/uploads/Region-1-1-Boys.txt

                                            Say I want everyone listed from Grand Blanc and Alpena ...

                                            so I use:
                                            ^!Replace "^(?!.*(Grand Blanc|Alpena).*$).+(\R|\Z)" >> "" WARS

                                            1 Omar Kaddurah 12 Grand Blanc 15:37.28 1
                                            6 Drake Carr 12 Grand Blanc 16:18.58 6
                                            8 Zachary Kughn 11 Grand Blanc 16:25.06 8
                                            13 Jalen Payne 12 Grand Blanc 16:40.68 13
                                            22 Scott Baughan 12 Grand Blanc 16:50.48 22
                                            23 Nicholas Lefler 12 Grand Blanc 16:51.02 23
                                            25 Carson Truesdell 10 Grand Blanc 16:56.10 25
                                            33 Ethan Crowell 10 Alpena 17:10.37 33
                                            46 R.J. Centala 9 Alpena 17:22.42 46
                                            50 Jared Labarge 11 Alpena 17:36.46 50
                                            51 Travis LaCross 11 Alpena 17:37.60 51
                                            53 Jacob Benson 10 Alpena 17:40.54 53
                                            69 Alexander Guzman 11 Alpena 18:14.48 69
                                            71 Nathan LaBarge 12 Alpena 18:17.02 71



                                            1 Grand Blanc 50 1 6 8 13 22 23 25
                                            10 Alpena 233 33 46 50 51 53 69 71


                                            Answer is I believe correct.

                                            So now I make it a two liner:
                                            ^!Set %DataTested%=^?{RegEx Term to Search For, Pipe Separated "or"}
                                            ^!Replace "^(?!.*(^%DataTested).*$).+(\R|\Z)" >> "" WARS

                                            Now what if Grand Blanc is the first or last thing on a line ....?

                                            Seems to still work:
                                            1 Omar Kaddurah 12 Grand Blanc 15:37.28 1
                                            6 Drake Carr 12 Grand Blanc 16:18.58 6
                                            Grand Blanc 8 Zachary Kughn 11 Grand Blanc
                                            16:25.06 8
                                            13 Jalen Payne 12 Grand Blanc
                                            22 Scott Baughan 12 Grand Blanc 16:50.48 22
                                            [deleted rest]

                                            I think I now have a universal "delete all lines not containing [fill in
                                            the blank using regex terms]" clip.
                                            It has not keyboard commands and appears to be blinding fast.
                                          • flo.gehrke
                                            ... Yes, Don, that s the way it works because it was adapted to the data in your first message... ... where the strings in question are positioned at the end
                                            Message 21 of 30 , Jun 30, 2011
                                            • 0 Attachment
                                              --- In ntb-clips@yahoogroups.com, Don <don@...> wrote:

                                              > Your look behind works only if it is at the end of the line.

                                              Yes, Don, that's the way it works because it was adapted to the data in your first message...

                                              > Input:
                                              > john@...
                                              > jane@...
                                              > jeff@...
                                              > fred@...

                                              where the strings in question are positioned at the end of line.

                                              Different from these data, now the substrings in question ('Grand Blanc' or 'Alpena') are not to be found at the end of line but on any position in line. So the command to delete all lines that do NOT contain 'Grand Blanc' or 'Alpena' can be a little bit shorter...

                                              ^!Replace "^(?!.*(Grand Blanc|Alpena)).*(\R|\Z)" >> "" WARS

                                              However, your latest information shows that this job is more based on *positive* criteria (find 'Grand Blanc' or 'Alpena') than on *negative* criteria (exclude 'com' or 'ru' in your first message). So, in this case, it probably is of no advantage to work with a Lookaround. Maybe it will suffice just to run something like...

                                              ^!SetClipboard ^$GetDocListAll("^.*(Grand Blanc|Alpena).*(\R|\Z)";$0)$
                                              ^!Toolbar New Document
                                              ^$GetClipboard$

                                              If you want to overwrite the original list you could try...

                                              ^!Select All
                                              ^$GetDocListAll("^.*(Grand Blanc|Alpena).*(\R|\Z)";$0)$

                                              Regards,
                                              Flo
                                            Your message has been successfully submitted and would be delivered to recipients shortly.