Loading ...
Sorry, an error occurred while loading the content.

Re: [Clip] Replacing All Words at Once

Expand Messages
  • flo.gehrke
    ... I m using Omnipage Pro but I still have to remove a lot of OCR-mistakes. I agree with you and Axel that it s impossible to capture all mistakes as
    Message 1 of 10 , Jun 4, 2013
    • 0 Attachment
      --- In ntb-clips@yahoogroups.com, "John Shotsky" <jshotsky@...> wrote:
      >
      > I would start by getting a better OCR program. I use
      > OmniPage Pro (...)
      > I can tell you for certain that you will NEVER be able to use
      > NoteTab to 'automatically' fix all the errors lower grade
      > OCR software will make. It simply isn't predictable what the
      > errors will be (...)
      > For instance, nearly every word can be hyphenated in text, and not
      > always in the same fashion...

      I'm using Omnipage Pro but I still have to remove a lot of OCR-mistakes.

      I agree with you and Axel that it's impossible to capture all mistakes as particular entries in a clip (or a long list that would be called by a clip). I've been cleaning scanned text now for many years with NT. In the long run, you will get across an unlimited amount of mistakes following no rule.

      In my experience, the only way to resolve this problem is with RegEx patterns. Regarding those examples Roopakshi posted with his first message like...

      dis- tribution --> distribution

      I would use a pattern like...

      ^!Replace "[[:lower:]]\K-\x20(?=[[:upper:][:lower:]])" >> "" WARS

      in order to remove any '- ' (hyphen followed by a space) between a lower character and an upper or lower character.

      I think, however, there's still another problem with hyphenation and bad breaks. Look at words like...

      corr- elation
      attorney-at- law
      proof- reading
      dis- tribution
      Hewlett- Packard

      In line #2, 3, and 5, we have to remove the space only but not the hyphen. AFAIK, there are no linguistic rules that could help us to define patterns which could match any type of bad breaks -- at least in German (maybe it's easier in the English language?). That's why I'm working with a kind of "controlled replacement". Test the following clip against that list and you'll see how it works:

      ^!Jump Doc_Start
      :BadBreak
      ^!SetWizardWidth 60
      ^!Find "[[:lower:]]\K-\x20(?=[[:upper:][:lower:]])" RS
      ^!IfError Out
      ^!Goto ^?{Remove bad breaks:==_Space only|Hyphen+space|Skip}
      :Space only
      ^!InsertText "-"
      ^!Goto BadBreak
      :Hyphen+space
      ^!InsertText ""
      ^!Goto BadBreak
      :Skip
      ^!Goto BadBreak
      :Out
      ^!Info Finished!
      ^!Jump Doc_Start

      In practice, it's more complicated, of course. There are bad breaks at CRNL like...

      dis-
      tribution

      also with space or without space etc.

      Nevertheless, it's an interesting job to find out the patterns matching types of OCR-mistakes as far as possible! Meanwhile, my clip is working with about 400 patterns correcting misspellings, bad breaks, changing abbreviations, the style of writing dates, currencies etc, and it's working fine.

      Regards,
      Flo
    • John Shotsky
      In my years of working with hyphenated words, I came to some conclusions that have helped me. Originally, I removed all hyphens that were not associated with
      Message 2 of 10 , Jun 4, 2013
      • 0 Attachment
        In my years of working with hyphenated words, I came to some conclusions that have helped me. Originally, I removed all hyphens that
        were not associated with known hyphenated entities. That failed, because all the hyphens that are removed were not correct, and it
        became impossible to find the words where a hyphen was incorrectly removed, such as in strawberryvanilla (strawberry-vanilla). So I
        had to reverse the logic and make every hyphen STAY that couldn't be removed by use of regex codes. It is much easier to spot a word
        with a hyphen that doesn't belong than identify a word where one is missing, since there is an unending number of such words.

        The end result is that I have written regex that looks at prefixes, and eliminates those that apply, another that looks at
        suffixes, and eliminates those, then another set looks on both lines to see if it should be removed.

        I have found no single rule applying to letter case, as product names can be capitalized and hyphenated or not - it is all
        subjective. As a result, my users regularly send me words with hyphens that aren't correct, and I add their word in one of the three
        sections. I don't get many any more, but below is a copy of just the prefix code (all one line).
        ===
        ^!Replace
        "\b(abso|ac|accom|accompa|accus?|ad|ag|alco|alu|ama|ameri?|any|ap|apri|appe|approx?i?|aro?u?|arti|assem|atmo|avo?|bal?|bam|bev?|beau
        |bis|black|bot|broc|bub|bul|bur|but|cab|cali?|canta|cara|carb?o?|carbohy|card|cas|caul?i?|cav|cele|cen?|cer|ch?i|chila|choc?|cinn?a?
        |cit|cle|cof|col|com?n?|combin?|compli|confec|consis|cori|cot|cov|cre|cri|cro|crys|cui?l?r?|cus|defi|des?|devel?|diago|dif|diffi|dig
        es|dis?|disap|discol|discrimi|dol|dou|driz|eas|effi|elec|epiph|esp?e?|evapo?|every|excel|expec|ext?e?|experi?|fa|fen|fif|fla|fra|fre
        |frit?|gen|gi?ar|gaz|geo|ger|gnoc|grad?|guac?a?|granu?|haba|Hal|han|hap|heri|holi|homog|hon|hori|how|hum|hun|hydroge|illus|im|immed?
        |incorpo|indi|indiv|inex|inter|irre|jala|ji|juli|ker|kiel|kiwi|ko|lem|Les|lib|lico|lla|lun|manufac|mara|marga|mari|mathe|mayo?n?|mea
        ?|mem|mis?|micro|mod|moz|muf|mush?|nar|nat|nec|nei|neigh|noo|occ?a?|ol|opin|oppor|orig|out|over|pap|par|pas|pat|pep|phar|pista|piz?|
        pome|pos?|portabel|possi|pow|p?re|prepa|pres|prob|proc?|provo|pud|pun|pur|quesadil|ra|rasp|rea|reci?|recog|recol|recom|recon|refrig|
        [Rr]eg|rel[ai]|repre|resi|restau|ri|ridicu|saf|sal|sand|sauer|sea|sec|sei|self|sepa?|ser|seve?r?|shal|short?|sim|siz|so|some|spat|sp
        ec?|specifi?|spi|spo|sri|stan|stu|sub|sug|supe?|sur|sym|syn|tab?|table|tan|tar|tech|tem|therm?|thor|thriv|tives?|tol|toma|tor?|tras?
        |trans|tri|trun|tur|typ|un|uncom|undis|unwel|[Vv]alen|vanil|veg?e?|versa|vinai|vir|vita|with|water|week|won|Worces|yo|zuc)\K\{-\}"
        >> "" AIRSW
        ^!IfError Next Else Skip_-1

        If anyone wants my other two sets, please let me know, I can send it privately. There is actually a lot more to my hyphen clip set,
        because many words that should have hyphens aren't hyphenated in the source, and vice versa. So my code then goes into specific
        situations and words to either remove or add hyphens. For example, 'one-at-a-time' cannot be handled by the above methodology, but
        can only be handled by a specific clip designed to detect that set of letters, with or without (all of) the hyphens, with or without
        the spaces, and place the hyphens IF that is the correct thing to do. Thus, it may differ from the source, but it will be
        grammatically correct anyway.

        Regards,
        John
        RecipeTools Web Site: <http://recipetools.gotdns.com/> http://recipetools.gotdns.com/
        John's Mags Yahoo Group: <http://groups.yahoo.com/group/johnsmags/> http://groups.yahoo.com/group/johnsmags/

        From: ntb-clips@yahoogroups.com [mailto:ntb-clips@yahoogroups.com] On Behalf Of flo.gehrke
        Sent: Tuesday, June 04, 2013 07:27
        To: ntb-clips@yahoogroups.com
        Subject: Re: [Clip] Replacing All Words at Once


        --- In ntb-clips@yahoogroups.com <mailto:ntb-clips%40yahoogroups.com> , "John Shotsky" <jshotsky@...> wrote:
        >
        > I would start by getting a better OCR program. I use
        > OmniPage Pro (...)
        > I can tell you for certain that you will NEVER be able to use
        > NoteTab to 'automatically' fix all the errors lower grade
        > OCR software will make. It simply isn't predictable what the
        > errors will be (...)
        > For instance, nearly every word can be hyphenated in text, and not
        > always in the same fashion...

        I'm using Omnipage Pro but I still have to remove a lot of OCR-mistakes.

        I agree with you and Axel that it's impossible to capture all mistakes as particular entries in a clip (or a long list that would be
        called by a clip). I've been cleaning scanned text now for many years with NT. In the long run, you will get across an unlimited
        amount of mistakes following no rule.

        In my experience, the only way to resolve this problem is with RegEx patterns. Regarding those examples Roopakshi posted with his
        first message like...

        dis- tribution --> distribution

        I would use a pattern like...

        ^!Replace "[[:lower:]]\K-\x20(?=[[:upper:][:lower:]])" >> "" WARS

        in order to remove any '- ' (hyphen followed by a space) between a lower character and an upper or lower character.

        I think, however, there's still another problem with hyphenation and bad breaks. Look at words like...

        corr- elation
        attorney-at- law
        proof- reading
        dis- tribution
        Hewlett- Packard

        In line #2, 3, and 5, we have to remove the space only but not the hyphen. AFAIK, there are no linguistic rules that could help us
        to define patterns which could match any type of bad breaks -- at least in German (maybe it's easier in the English language?).
        That's why I'm working with a kind of "controlled replacement". Test the following clip against that list and you'll see how it
        works:

        ^!Jump Doc_Start
        :BadBreak
        ^!SetWizardWidth 60
        ^!Find "[[:lower:]]\K-\x20(?=[[:upper:][:lower:]])" RS
        ^!IfError Out
        ^!Goto ^?{Remove bad breaks:==_Space only|Hyphen+space|Skip}
        :Space only
        ^!InsertText "-"
        ^!Goto BadBreak
        :Hyphen+space
        ^!InsertText ""
        ^!Goto BadBreak
        :Skip
        ^!Goto BadBreak
        :Out
        ^!Info Finished!
        ^!Jump Doc_Start

        In practice, it's more complicated, of course. There are bad breaks at CRNL like...

        dis-
        tribution

        also with space or without space etc.

        Nevertheless, it's an interesting job to find out the patterns matching types of OCR-mistakes as far as possible! Meanwhile, my clip
        is working with about 400 patterns correcting misspellings, bad breaks, changing abbreviations, the style of writing dates,
        currencies etc, and it's working fine.

        Regards,
        Flo



        [Non-text portions of this message have been removed]
      • Axel Berger
        ... I use Abbyy Fine Reader Prof. version 6 and the results depend strongly on the quality of the scan. On good scans they re nearly perfect. ... I agree, if
        Message 3 of 10 , Jun 4, 2013
        • 0 Attachment
          John Shotsky wrote:
          > OmniPage is the most expensive program I own,

          I use Abbyy Fine Reader Prof. version 6 and the results depend strongly
          on the quality of the scan. On good scans they're nearly perfect.

          > The point I'm trying to make here is to fix the errors while
          > in the OCR tool,

          I agree, if only because it highlights all doubtful places in the text
          view and helps you deal with them quickly and efficiently. Some errors
          go undetected and unmarked, but not many.

          Axel
        • John Shotsky
          Try scanning and OCR on a few recipes out of a cookbook. It s like they never heard of fractions. That s the biggest problem with most of them - they are
          Message 4 of 10 , Jun 4, 2013
          • 0 Attachment
            Try scanning and OCR on a few recipes out of a cookbook. It's like they never heard of fractions. That's the biggest problem with
            most of them - they are designed for office documents, not technical documents. It's even OmniPage's weakest point, but it is FAR
            ahead of whatever is in second. Scan a few cookbooks, and you'll be pulling your hair out. (And looking for a new OCR product.) J
            Regards,
            John
            RecipeTools Web Site: <http://recipetools.gotdns.com/> http://recipetools.gotdns.com/
            John's Mags Yahoo Group: <http://groups.yahoo.com/group/johnsmags/> http://groups.yahoo.com/group/johnsmags/

            From: ntb-clips@yahoogroups.com [mailto:ntb-clips@yahoogroups.com] On Behalf Of Axel Berger
            Sent: Tuesday, June 04, 2013 09:17
            To: ntb-clips@yahoogroups.com
            Subject: Re: [Clip] Replacing All Words at Once


            John Shotsky wrote:
            > OmniPage is the most expensive program I own,

            I use Abbyy Fine Reader Prof. version 6 and the results depend strongly
            on the quality of the scan. On good scans they're nearly perfect.

            > The point I'm trying to make here is to fix the errors while
            > in the OCR tool,

            I agree, if only because it highlights all doubtful places in the text
            view and helps you deal with them quickly and efficiently. Some errors
            go undetected and unmarked, but not many.

            Axel



            [Non-text portions of this message have been removed]
          Your message has been successfully submitted and would be delivered to recipients shortly.