Loading ...
Sorry, an error occurred while loading the content.

Re: [Clip] searching for large blocks of repeated text

Expand Messages
  • Wayne VanWeerthuizen
    Surprisingly, the ^$GetMD5Text()$ function is probably the best way to find repeated blocks of text in a moderately large file. The technique is fast, simple
    Message 1 of 8 , Oct 25, 2005
    • 0 Attachment
      Surprisingly, the ^$GetMD5Text()$ function is probably the best way to find
      repeated blocks of text in a moderately large file. The technique is
      fast, simple (at least once you grasp the concept), and requires only
      one pass through the file being searched.

      The GetMD5Text
      function takes a string of any length and creates a 32-digit hexadecimal
      value from it. For all practical purposes, this value can be considered
      unique to the string used to create it. (Finding two strings that give the
      same MD5 value is meant to be very difficult. We simply do not need to
      worry about the insignificant chance of encountering such a case
      accidentally. The odds that two random items out of a set of a million
      million items will have the same MD5 value is less than 1 in a million
      million
      million million.)

      This value gotten from ^$GetMD5Text()$ is very convenient for use as a
      variable name. It will never contain any forbidden characters, and it is
      short enough that it does not require much memory to store. (The
      typical modern computer should be able to create millions of such variables
      before running into "out of memory" issues.)

      Anyway, you can use ^$GetMD5Text()$ as part of a method for keeping track
      of what large items a program has previously seen -- good for any
      application where we need to check an unsorted collection of items for
      duplicates.

      Below, I use ^%^%Seen%% (a Notetab variable inside a Notetab variable) to
      take MD5 checksum calculated in the previous line, and create a variable of
      that name that we will set to 1 for each text string we find. But before
      that, we need to check if it was already set to 1 - in which case we have
      found a duplicate.

      Outline of the clip is below. I've provided the four most critical lines
      of code. You will need work out the remaining details yourself...

      Wayne VanWeerthuizen



      :StartOfLoop

      ;--- Insert code here to read the next block of text into the variable
      %text% ----
      ;---
      ;--- Simply getting one line at a time may be too risky. You probably
      do not want
      ;--- duplicated blank lines to be removed. And short lines at the ends
      ;--- of distinct paragraphs might still be identical (such as if each
      ends with
      ;--- a common word on its own line.)
      ;---
      ;--- It is okay (and may be necessary for your situation!) if these blocks
      ;--- of text overlap. For instance, might I recommend you try getting every
      ;--- adjacent pair of nonblank lines.

      ;--- Be sure to include code to exit the loop once no more blocks of
      text can be
      ;--- read.

      ^!Set %Seen%="^$GetMD5Text("^%text%")$"
      ^!IfEmpty ^%^%Seen%% NotPreviouslySeen

      ; %text% Was seen before, so this is a duplicate
      ; ---- Insert code here to delete duplicate ----

      :NotPreviouslySeen
      ^!Set %^%Seen%%=1

      ;Loop and repeat
      ^!Goto StartOfLoop


      >-----Original Message-----
      >From: ntb-clips@yahoogroups.com [mailto:ntb-clips@yahoogroups.com] On
      >Behalf Of Jeffery Scism
      >Sent: Sunday, October 23, 2005 4:07 PM
      >To: ntb-clips@yahoogroups.com
      >Subject: [Clip] searching for large blocks of repeated text
      >
      >
      >I have a 9,000+ page long text document extracted from a genealogy
      >program. It contains literally thousands of pages of repeated text
      >strings created when that program "merged" identical individuals. It
      >repeated all of the notes from each person.
      >
      >Is there a way to check for large blocks of repeated text and delete
      >all but the first instance? Without having to manually select each block
      >
      >first? (something like get Paragraph, and then doing a replace with
      >"nothing" on all others identical? then stepping to the next?
      >
      >Since it is an extract from an original I am willing to use it to "play"
      >
      >with, it can always be recreated in its much bloated form.
      >
      >(Manually editing the original files is going to take years, unless I
      >can do it in a text editor instead of "in residence" in the program. (In
      >
      >Gedcom format it can be edited in NoteTab)
      >
      >
      >
    Your message has been successfully submitted and would be delivered to recipients shortly.