Loading ...
Sorry, an error occurred while loading the content.

23859RE: [Clip] Replacing All Words at Once

Expand Messages
  • John Shotsky
    Jun 4, 2013
    • 0 Attachment
      In my years of working with hyphenated words, I came to some conclusions that have helped me. Originally, I removed all hyphens that
      were not associated with known hyphenated entities. That failed, because all the hyphens that are removed were not correct, and it
      became impossible to find the words where a hyphen was incorrectly removed, such as in strawberryvanilla (strawberry-vanilla). So I
      had to reverse the logic and make every hyphen STAY that couldn't be removed by use of regex codes. It is much easier to spot a word
      with a hyphen that doesn't belong than identify a word where one is missing, since there is an unending number of such words.

      The end result is that I have written regex that looks at prefixes, and eliminates those that apply, another that looks at
      suffixes, and eliminates those, then another set looks on both lines to see if it should be removed.

      I have found no single rule applying to letter case, as product names can be capitalized and hyphenated or not - it is all
      subjective. As a result, my users regularly send me words with hyphens that aren't correct, and I add their word in one of the three
      sections. I don't get many any more, but below is a copy of just the prefix code (all one line).
      >> "" AIRSW
      ^!IfError Next Else Skip_-1

      If anyone wants my other two sets, please let me know, I can send it privately. There is actually a lot more to my hyphen clip set,
      because many words that should have hyphens aren't hyphenated in the source, and vice versa. So my code then goes into specific
      situations and words to either remove or add hyphens. For example, 'one-at-a-time' cannot be handled by the above methodology, but
      can only be handled by a specific clip designed to detect that set of letters, with or without (all of) the hyphens, with or without
      the spaces, and place the hyphens IF that is the correct thing to do. Thus, it may differ from the source, but it will be
      grammatically correct anyway.

      RecipeTools Web Site: <http://recipetools.gotdns.com/> http://recipetools.gotdns.com/
      John's Mags Yahoo Group: <http://groups.yahoo.com/group/johnsmags/> http://groups.yahoo.com/group/johnsmags/

      From: ntb-clips@yahoogroups.com [mailto:ntb-clips@yahoogroups.com] On Behalf Of flo.gehrke
      Sent: Tuesday, June 04, 2013 07:27
      To: ntb-clips@yahoogroups.com
      Subject: Re: [Clip] Replacing All Words at Once

      --- In ntb-clips@yahoogroups.com <mailto:ntb-clips%40yahoogroups.com> , "John Shotsky" <jshotsky@...> wrote:
      > I would start by getting a better OCR program. I use
      > OmniPage Pro (...)
      > I can tell you for certain that you will NEVER be able to use
      > NoteTab to 'automatically' fix all the errors lower grade
      > OCR software will make. It simply isn't predictable what the
      > errors will be (...)
      > For instance, nearly every word can be hyphenated in text, and not
      > always in the same fashion...

      I'm using Omnipage Pro but I still have to remove a lot of OCR-mistakes.

      I agree with you and Axel that it's impossible to capture all mistakes as particular entries in a clip (or a long list that would be
      called by a clip). I've been cleaning scanned text now for many years with NT. In the long run, you will get across an unlimited
      amount of mistakes following no rule.

      In my experience, the only way to resolve this problem is with RegEx patterns. Regarding those examples Roopakshi posted with his
      first message like...

      dis- tribution --> distribution

      I would use a pattern like...

      ^!Replace "[[:lower:]]\K-\x20(?=[[:upper:][:lower:]])" >> "" WARS

      in order to remove any '- ' (hyphen followed by a space) between a lower character and an upper or lower character.

      I think, however, there's still another problem with hyphenation and bad breaks. Look at words like...

      corr- elation
      attorney-at- law
      proof- reading
      dis- tribution
      Hewlett- Packard

      In line #2, 3, and 5, we have to remove the space only but not the hyphen. AFAIK, there are no linguistic rules that could help us
      to define patterns which could match any type of bad breaks -- at least in German (maybe it's easier in the English language?).
      That's why I'm working with a kind of "controlled replacement". Test the following clip against that list and you'll see how it

      ^!Jump Doc_Start
      ^!SetWizardWidth 60
      ^!Find "[[:lower:]]\K-\x20(?=[[:upper:][:lower:]])" RS
      ^!IfError Out
      ^!Goto ^?{Remove bad breaks:==_Space only|Hyphen+space|Skip}
      :Space only
      ^!InsertText "-"
      ^!Goto BadBreak
      ^!InsertText ""
      ^!Goto BadBreak
      ^!Goto BadBreak
      ^!Info Finished!
      ^!Jump Doc_Start

      In practice, it's more complicated, of course. There are bad breaks at CRNL like...


      also with space or without space etc.

      Nevertheless, it's an interesting job to find out the patterns matching types of OCR-mistakes as far as possible! Meanwhile, my clip
      is working with about 400 patterns correcting misspellings, bad breaks, changing abbreviations, the style of writing dates,
      currencies etc, and it's working fine.


      [Non-text portions of this message have been removed]
    • Show all 10 messages in this topic