Loading ...
Sorry, an error occurred while loading the content.

[Clip] Re: Sort & Eliminate Duplicate Lines

Expand Messages
  • diodeom
    ... NoteTab isn t big on sub-second values, so I m letting it ask (via clip) the DOS Time function to help out. If you re interested, the timing clip I m using
    Message 1 of 23 , Jul 31, 2010
    • 0 Attachment
      "flo.gehrke" <flo.gehrke@...> wrote:
      > --- In ntb-clips@yahoogroups.com, "diodeom" <diomir@> wrote:
      > >
      > > I wouldn't necessarily look at speed as the criterion...
      > diodeom,
      > Thanks for the effort you take in this issue!
      > I still wonder if we couldn't make that job simpler. So if you are willing to test another version -- have a look at this...
      > ^!SetScreenUpdate Off
      > ^!Replace "^([^\r\n]+)(\X+?)?\R\K\1(\R|\Z)" >> "" AWRS
      > ^!IfError End Else Skip_-1
      > I've tested it with 100,000 lines, and it made a good job within a few seconds. I suppose you have better tools than me for testing that -- I'm using just NT and nothing else.

      NoteTab isn't big on sub-second values, so I'm letting it ask (via clip) the DOS Time function to help out. If you're interested, the timing clip I'm using is posted here, right under Sheri's:


      And I apply it like this:

      ^!Clip TimeIt
      ^!Repl... <pattern>
      ^!IfError End Else Skip_-1
      ^!Clip TimeIt

      I certainly like the conciseness of your new rendition. It helped me to this hairy realization: it's all lovely if we're testing on extremely short and extremely repetitious line samples, no matter how many, but what if any of our non-greedy patterns has to sniff through looong pages of lengthy lines before it finds (or not) the desired match? As I envision it, the poor Regex engine is told to be modest, so it timidly looks at one character at the time and checks if the match is made immediately after. If not, it advances its capture by another single character and checks for the subsequent presence of the match again. And on and on. An exact reverse of "spitting out" that greedy captures have to go through. No big deal if the sought after "Bertha" is just twenty-some characters further, but what if it takes thousands and thousands of characters before one can be spotted or "found to be absent?" :) It ought to cost some time.

      Well, I pasted twice just 100 unique lines (averaging about 125 chars long) into a test pile (to get a hundred of singular repetitions spread exactly hundred lines apart) and ran clips on it. The previous two patterns had to slave for 11 - 16 seconds to complete; the last one I cut short (after seeing in the status bar what it took per match).

      It might be junk science, but I'm thinking that having a match exactly 100 lines away (in the 200-line long file) could be used to show how comparatively "forward-" and backtracking drains the resources -- when having the same <em>average</em> distance to span. When I run on this text your previous clip in both alterations, with greedy and non-greedy \X, the speed results are pretty similar, as expected.

      After all that fun, I'd say it's only fair to try out a no-nonsense conventional loop... I've been trying to avoid:

      ^!Set %n%=0
      ^!Inc %n%
      ^!If ^%n%>^$GetTextLineCount$ End
      ^!Jump ^%n%
      ^!Set %line%=^$GetParagraph(^%n%)$
      ^!Replace "^p^%line%" >> "" AS
      ^!Goto Loop

      ... to find out that it needs only about a third of time as compared to the nearest challenger... on either test file. :)
    • Ray Shapp
      Hi Art, I m just now going back through some unopened mail that was in a folder from an older computer. Your suggestion about using the right-click short cut
      Message 2 of 23 , Apr 11, 2011
      • 0 Attachment
        Hi Art,

        I'm just now going back through some unopened mail that was in a folder from
        an older computer. Your suggestion about using the right-click "short cut
        menu" is a good one. (I thought it was called the "context menu".) The
        original problem was solved in July, but I'm using your suggestion now for
        that problem and for other purposes too..

        Thank you.

        Ray Shapp

        On Fri, Jul 30, 2010 at 7:24 AM, Art Kocsis <artkns@...> wrote:

        > Hi Ray,
        > Although you have received lots of good tips on using a clip for your
        > problem,
        > at the risk of putting my foot in my mouth, did you consider the obvious? -
        > Setting the sort option to remove duplicates and to put the sort and select
        > all
        > tools at the top of your shortcut menu?
        > View | Options | Tools | Sort Removes Duplicates [check box]
        > View | Options | Shortcut Menu | Select All [check box & move]
        > View | Options | Shortcut Menu | Sort Ascending [check box & move]
        > Then two right clicks anywhere in your document will accomplish your task
        > with a minimum of mouse movement and hunting for icons/menu items to click.
        > Personally, I try to use the built in functions/tools as much as possible
        > to
        > minimize the clutter of too many clips. I have about three dozen favorite
        > clips
        > on my clipbar already. Using the shortcut menu takes less time and effort
        > clicking on a clipbar icon even if the clipbar real estate wasn't so
        > precious.
        > Just in case you weren't ware of this.
        > Art
        > At 07-29-10 15:51, you wrote:
        > >To All,
        > >
        > >Please direct me to a library of sort clips.
        > >
        > >For my current need, I am looking for a clip that will eliminate duplicate
        > >lines in a text file. It's ok if the clip doesn't do the sort. I can use a
        > >menu command to prepare the file in ascending or descending sequence
        > before
        > >running the clip. It would be good if the clip can work with "folded"
        > lines.
        > >I.e. lines that are a bit longer than the screen display. I normally run
        > >NoteTab with Word Wrap toggled ON (lines wrap to be visible on screen). If
        > >necessary, I could toggle Word Wrap OFF whenever I run the clip.
        > >
        > >Most lines in the file are less than 256 characters long. In fact, the
        > >average length is around 40 characters.
        > >
        > >I am running NoteTab Pro v6.2/fv.
        > >
        > >Thank you for the help.
        > >
        > >Ray Shapp

        [Non-text portions of this message have been removed]
      Your message has been successfully submitted and would be delivered to recipients shortly.