Re: [Clip] searching for large blocks of repeated text
- Surprisingly, the ^$GetMD5Text()$ function is probably the best way to find
repeated blocks of text in a moderately large file. The technique is
fast, simple (at least once you grasp the concept), and requires only
one pass through the file being searched.
function takes a string of any length and creates a 32-digit hexadecimal
value from it. For all practical purposes, this value can be considered
unique to the string used to create it. (Finding two strings that give the
same MD5 value is meant to be very difficult. We simply do not need to
worry about the insignificant chance of encountering such a case
accidentally. The odds that two random items out of a set of a million
million items will have the same MD5 value is less than 1 in a million
This value gotten from ^$GetMD5Text()$ is very convenient for use as a
variable name. It will never contain any forbidden characters, and it is
short enough that it does not require much memory to store. (The
typical modern computer should be able to create millions of such variables
before running into "out of memory" issues.)
Anyway, you can use ^$GetMD5Text()$ as part of a method for keeping track
of what large items a program has previously seen -- good for any
application where we need to check an unsorted collection of items for
Below, I use ^%^%Seen%% (a Notetab variable inside a Notetab variable) to
take MD5 checksum calculated in the previous line, and create a variable of
that name that we will set to 1 for each text string we find. But before
that, we need to check if it was already set to 1 - in which case we have
found a duplicate.
Outline of the clip is below. I've provided the four most critical lines
of code. You will need work out the remaining details yourself...
;--- Insert code here to read the next block of text into the variable
;--- Simply getting one line at a time may be too risky. You probably
do not want
;--- duplicated blank lines to be removed. And short lines at the ends
;--- of distinct paragraphs might still be identical (such as if each
;--- a common word on its own line.)
;--- It is okay (and may be necessary for your situation!) if these blocks
;--- of text overlap. For instance, might I recommend you try getting every
;--- adjacent pair of nonblank lines.
;--- Be sure to include code to exit the loop once no more blocks of
text can be
^!IfEmpty ^%^%Seen%% NotPreviouslySeen
; %text% Was seen before, so this is a duplicate
; ---- Insert code here to delete duplicate ----
;Loop and repeat
>From: email@example.com [mailto:firstname.lastname@example.org] On
>Behalf Of Jeffery Scism
>Sent: Sunday, October 23, 2005 4:07 PM
>Subject: [Clip] searching for large blocks of repeated text
>I have a 9,000+ page long text document extracted from a genealogy
>program. It contains literally thousands of pages of repeated text
>strings created when that program "merged" identical individuals. It
>repeated all of the notes from each person.
>Is there a way to check for large blocks of repeated text and delete
>all but the first instance? Without having to manually select each block
>first? (something like get Paragraph, and then doing a replace with
>"nothing" on all others identical? then stepping to the next?
>Since it is an extract from an original I am willing to use it to "play"
>with, it can always be recreated in its much bloated form.
>(Manually editing the original files is going to take years, unless I
>can do it in a text editor instead of "in residence" in the program. (In
>Gedcom format it can be edited in NoteTab)