Loading ...
Sorry, an error occurred while loading the content.

RE: [Clip] Replacing All Words at Once

Expand Messages
  • John Shotsky
    In my years of working with hyphenated words, I came to some conclusions that have helped me. Originally, I removed all hyphens that were not associated with
    Message 1 of 10 , Jun 4, 2013
    • 0 Attachment
      In my years of working with hyphenated words, I came to some conclusions that have helped me. Originally, I removed all hyphens that
      were not associated with known hyphenated entities. That failed, because all the hyphens that are removed were not correct, and it
      became impossible to find the words where a hyphen was incorrectly removed, such as in strawberryvanilla (strawberry-vanilla). So I
      had to reverse the logic and make every hyphen STAY that couldn't be removed by use of regex codes. It is much easier to spot a word
      with a hyphen that doesn't belong than identify a word where one is missing, since there is an unending number of such words.

      The end result is that I have written regex that looks at prefixes, and eliminates those that apply, another that looks at
      suffixes, and eliminates those, then another set looks on both lines to see if it should be removed.

      I have found no single rule applying to letter case, as product names can be capitalized and hyphenated or not - it is all
      subjective. As a result, my users regularly send me words with hyphens that aren't correct, and I add their word in one of the three
      sections. I don't get many any more, but below is a copy of just the prefix code (all one line).
      ===
      ^!Replace
      "\b(abso|ac|accom|accompa|accus?|ad|ag|alco|alu|ama|ameri?|any|ap|apri|appe|approx?i?|aro?u?|arti|assem|atmo|avo?|bal?|bam|bev?|beau
      |bis|black|bot|broc|bub|bul|bur|but|cab|cali?|canta|cara|carb?o?|carbohy|card|cas|caul?i?|cav|cele|cen?|cer|ch?i|chila|choc?|cinn?a?
      |cit|cle|cof|col|com?n?|combin?|compli|confec|consis|cori|cot|cov|cre|cri|cro|crys|cui?l?r?|cus|defi|des?|devel?|diago|dif|diffi|dig
      es|dis?|disap|discol|discrimi|dol|dou|driz|eas|effi|elec|epiph|esp?e?|evapo?|every|excel|expec|ext?e?|experi?|fa|fen|fif|fla|fra|fre
      |frit?|gen|gi?ar|gaz|geo|ger|gnoc|grad?|guac?a?|granu?|haba|Hal|han|hap|heri|holi|homog|hon|hori|how|hum|hun|hydroge|illus|im|immed?
      |incorpo|indi|indiv|inex|inter|irre|jala|ji|juli|ker|kiel|kiwi|ko|lem|Les|lib|lico|lla|lun|manufac|mara|marga|mari|mathe|mayo?n?|mea
      ?|mem|mis?|micro|mod|moz|muf|mush?|nar|nat|nec|nei|neigh|noo|occ?a?|ol|opin|oppor|orig|out|over|pap|par|pas|pat|pep|phar|pista|piz?|
      pome|pos?|portabel|possi|pow|p?re|prepa|pres|prob|proc?|provo|pud|pun|pur|quesadil|ra|rasp|rea|reci?|recog|recol|recom|recon|refrig|
      [Rr]eg|rel[ai]|repre|resi|restau|ri|ridicu|saf|sal|sand|sauer|sea|sec|sei|self|sepa?|ser|seve?r?|shal|short?|sim|siz|so|some|spat|sp
      ec?|specifi?|spi|spo|sri|stan|stu|sub|sug|supe?|sur|sym|syn|tab?|table|tan|tar|tech|tem|therm?|thor|thriv|tives?|tol|toma|tor?|tras?
      |trans|tri|trun|tur|typ|un|uncom|undis|unwel|[Vv]alen|vanil|veg?e?|versa|vinai|vir|vita|with|water|week|won|Worces|yo|zuc)\K\{-\}"
      >> "" AIRSW
      ^!IfError Next Else Skip_-1

      If anyone wants my other two sets, please let me know, I can send it privately. There is actually a lot more to my hyphen clip set,
      because many words that should have hyphens aren't hyphenated in the source, and vice versa. So my code then goes into specific
      situations and words to either remove or add hyphens. For example, 'one-at-a-time' cannot be handled by the above methodology, but
      can only be handled by a specific clip designed to detect that set of letters, with or without (all of) the hyphens, with or without
      the spaces, and place the hyphens IF that is the correct thing to do. Thus, it may differ from the source, but it will be
      grammatically correct anyway.

      Regards,
      John
      RecipeTools Web Site: <http://recipetools.gotdns.com/> http://recipetools.gotdns.com/
      John's Mags Yahoo Group: <http://groups.yahoo.com/group/johnsmags/> http://groups.yahoo.com/group/johnsmags/

      From: ntb-clips@yahoogroups.com [mailto:ntb-clips@yahoogroups.com] On Behalf Of flo.gehrke
      Sent: Tuesday, June 04, 2013 07:27
      To: ntb-clips@yahoogroups.com
      Subject: Re: [Clip] Replacing All Words at Once


      --- In ntb-clips@yahoogroups.com <mailto:ntb-clips%40yahoogroups.com> , "John Shotsky" <jshotsky@...> wrote:
      >
      > I would start by getting a better OCR program. I use
      > OmniPage Pro (...)
      > I can tell you for certain that you will NEVER be able to use
      > NoteTab to 'automatically' fix all the errors lower grade
      > OCR software will make. It simply isn't predictable what the
      > errors will be (...)
      > For instance, nearly every word can be hyphenated in text, and not
      > always in the same fashion...

      I'm using Omnipage Pro but I still have to remove a lot of OCR-mistakes.

      I agree with you and Axel that it's impossible to capture all mistakes as particular entries in a clip (or a long list that would be
      called by a clip). I've been cleaning scanned text now for many years with NT. In the long run, you will get across an unlimited
      amount of mistakes following no rule.

      In my experience, the only way to resolve this problem is with RegEx patterns. Regarding those examples Roopakshi posted with his
      first message like...

      dis- tribution --> distribution

      I would use a pattern like...

      ^!Replace "[[:lower:]]\K-\x20(?=[[:upper:][:lower:]])" >> "" WARS

      in order to remove any '- ' (hyphen followed by a space) between a lower character and an upper or lower character.

      I think, however, there's still another problem with hyphenation and bad breaks. Look at words like...

      corr- elation
      attorney-at- law
      proof- reading
      dis- tribution
      Hewlett- Packard

      In line #2, 3, and 5, we have to remove the space only but not the hyphen. AFAIK, there are no linguistic rules that could help us
      to define patterns which could match any type of bad breaks -- at least in German (maybe it's easier in the English language?).
      That's why I'm working with a kind of "controlled replacement". Test the following clip against that list and you'll see how it
      works:

      ^!Jump Doc_Start
      :BadBreak
      ^!SetWizardWidth 60
      ^!Find "[[:lower:]]\K-\x20(?=[[:upper:][:lower:]])" RS
      ^!IfError Out
      ^!Goto ^?{Remove bad breaks:==_Space only|Hyphen+space|Skip}
      :Space only
      ^!InsertText "-"
      ^!Goto BadBreak
      :Hyphen+space
      ^!InsertText ""
      ^!Goto BadBreak
      :Skip
      ^!Goto BadBreak
      :Out
      ^!Info Finished!
      ^!Jump Doc_Start

      In practice, it's more complicated, of course. There are bad breaks at CRNL like...

      dis-
      tribution

      also with space or without space etc.

      Nevertheless, it's an interesting job to find out the patterns matching types of OCR-mistakes as far as possible! Meanwhile, my clip
      is working with about 400 patterns correcting misspellings, bad breaks, changing abbreviations, the style of writing dates,
      currencies etc, and it's working fine.

      Regards,
      Flo



      [Non-text portions of this message have been removed]
    • Axel Berger
      ... I use Abbyy Fine Reader Prof. version 6 and the results depend strongly on the quality of the scan. On good scans they re nearly perfect. ... I agree, if
      Message 2 of 10 , Jun 4, 2013
      • 0 Attachment
        John Shotsky wrote:
        > OmniPage is the most expensive program I own,

        I use Abbyy Fine Reader Prof. version 6 and the results depend strongly
        on the quality of the scan. On good scans they're nearly perfect.

        > The point I'm trying to make here is to fix the errors while
        > in the OCR tool,

        I agree, if only because it highlights all doubtful places in the text
        view and helps you deal with them quickly and efficiently. Some errors
        go undetected and unmarked, but not many.

        Axel
      • John Shotsky
        Try scanning and OCR on a few recipes out of a cookbook. It s like they never heard of fractions. That s the biggest problem with most of them - they are
        Message 3 of 10 , Jun 4, 2013
        • 0 Attachment
          Try scanning and OCR on a few recipes out of a cookbook. It's like they never heard of fractions. That's the biggest problem with
          most of them - they are designed for office documents, not technical documents. It's even OmniPage's weakest point, but it is FAR
          ahead of whatever is in second. Scan a few cookbooks, and you'll be pulling your hair out. (And looking for a new OCR product.) J
          Regards,
          John
          RecipeTools Web Site: <http://recipetools.gotdns.com/> http://recipetools.gotdns.com/
          John's Mags Yahoo Group: <http://groups.yahoo.com/group/johnsmags/> http://groups.yahoo.com/group/johnsmags/

          From: ntb-clips@yahoogroups.com [mailto:ntb-clips@yahoogroups.com] On Behalf Of Axel Berger
          Sent: Tuesday, June 04, 2013 09:17
          To: ntb-clips@yahoogroups.com
          Subject: Re: [Clip] Replacing All Words at Once


          John Shotsky wrote:
          > OmniPage is the most expensive program I own,

          I use Abbyy Fine Reader Prof. version 6 and the results depend strongly
          on the quality of the scan. On good scans they're nearly perfect.

          > The point I'm trying to make here is to fix the errors while
          > in the OCR tool,

          I agree, if only because it highlights all doubtful places in the text
          view and helps you deal with them quickly and efficiently. Some errors
          go undetected and unmarked, but not many.

          Axel



          [Non-text portions of this message have been removed]
        Your message has been successfully submitted and would be delivered to recipients shortly.