Loading ...
Sorry, an error occurred while loading the content.

23858Re: [Clip] Replacing All Words at Once

Expand Messages
  • flo.gehrke
    Jun 4, 2013
      --- In ntb-clips@yahoogroups.com, "John Shotsky" <jshotsky@...> wrote:
      >
      > I would start by getting a better OCR program. I use
      > OmniPage Pro (...)
      > I can tell you for certain that you will NEVER be able to use
      > NoteTab to 'automatically' fix all the errors lower grade
      > OCR software will make. It simply isn't predictable what the
      > errors will be (...)
      > For instance, nearly every word can be hyphenated in text, and not
      > always in the same fashion...

      I'm using Omnipage Pro but I still have to remove a lot of OCR-mistakes.

      I agree with you and Axel that it's impossible to capture all mistakes as particular entries in a clip (or a long list that would be called by a clip). I've been cleaning scanned text now for many years with NT. In the long run, you will get across an unlimited amount of mistakes following no rule.

      In my experience, the only way to resolve this problem is with RegEx patterns. Regarding those examples Roopakshi posted with his first message like...

      dis- tribution --> distribution

      I would use a pattern like...

      ^!Replace "[[:lower:]]\K-\x20(?=[[:upper:][:lower:]])" >> "" WARS

      in order to remove any '- ' (hyphen followed by a space) between a lower character and an upper or lower character.

      I think, however, there's still another problem with hyphenation and bad breaks. Look at words like...

      corr- elation
      attorney-at- law
      proof- reading
      dis- tribution
      Hewlett- Packard

      In line #2, 3, and 5, we have to remove the space only but not the hyphen. AFAIK, there are no linguistic rules that could help us to define patterns which could match any type of bad breaks -- at least in German (maybe it's easier in the English language?). That's why I'm working with a kind of "controlled replacement". Test the following clip against that list and you'll see how it works:

      ^!Jump Doc_Start
      :BadBreak
      ^!SetWizardWidth 60
      ^!Find "[[:lower:]]\K-\x20(?=[[:upper:][:lower:]])" RS
      ^!IfError Out
      ^!Goto ^?{Remove bad breaks:==_Space only|Hyphen+space|Skip}
      :Space only
      ^!InsertText "-"
      ^!Goto BadBreak
      :Hyphen+space
      ^!InsertText ""
      ^!Goto BadBreak
      :Skip
      ^!Goto BadBreak
      :Out
      ^!Info Finished!
      ^!Jump Doc_Start

      In practice, it's more complicated, of course. There are bad breaks at CRNL like...

      dis-
      tribution

      also with space or without space etc.

      Nevertheless, it's an interesting job to find out the patterns matching types of OCR-mistakes as far as possible! Meanwhile, my clip is working with about 400 patterns correcting misspellings, bad breaks, changing abbreviations, the style of writing dates, currencies etc, and it's working fine.

      Regards,
      Flo
    • Show all 10 messages in this topic