Loading ...
Sorry, an error occurred while loading the content.

Re: Re: [Clip] Replacing All Words at Once

Expand Messages
  • Adrian Worsfold
    Hello Roopakshi Pathania I think the warning about variation from OCR scanned scripts is correct because each time you are going to have to add to the list of
    Message 1 of 10 , Jun 4, 2013
    • 0 Attachment
      Hello Roopakshi Pathania

      I think the warning about variation from OCR scanned scripts is correct because each time you are going to have to add to the list of replacements. And what if modem is modem and not modern?

      The similar list of replacements I have and added to is because the text I deal with is regular. I'm sure others can see this clip as cumbersome, and I don't do regex with its greater flexibility (unless offered and explained).

      H="Jobcentre excess remove"
      ^!Jump Doc_Start
      :LOOP
      ^!Replace "=3&setype=2&pg=4&AVSDM=" >> "" WAS
      ^!Replace "&pp=25&" >> "" WAS
      ^!Replace "pg=1" >> "" WAS
      ^!Replace "pg=2" >> "" WAS
      ^!Replace "pg=3" >> "" WAS
      ^!Replace "pg=4" >> "" WAS
      ^!Replace "pg=5" >> "" WAS
      ^!Replace "pg=6" >> "" WAS
      ^!Replace "pg=7" >> "" WAS
      ^!Replace "&where=HU7+4UD&sort=rv.dt.di&rad=20&rad_units=miles" >> "" WAS
      ^!Replace "&re=134" >> "" WAS
      ^!Replace "&re=3" >> "" WAS
      ^!Replace "&AVSDM=" >> "" WAS
      ^!IfError END
      ^!GoTo LOOP
      :END
      ^!Jump Doc_End

      (The Loop is probably unnecessary, and the Pg= could be looped with a counter, and looks like some repetition, but additions into the clip respond to what has been left over in later searched pages.)

      The result is a page URL is reduced to its essential (by which it can still be found again).




      Adrian Worsfold

      http://www.pluralist.co.uk
      http://pluralistspeaks.blogspot.com
      pluralist@...
      04-06-2013
      ----- Received the following content -----
      From: Roopakshi Pathania
      Receiver: ntb-clips
      Time: 2013-06-04, 08:12:36
      Subject: Re: [Clip] Replacing All Words at Once


      [Non-text portions of this message have been removed]
    • flo.gehrke
      ... I m using Omnipage Pro but I still have to remove a lot of OCR-mistakes. I agree with you and Axel that it s impossible to capture all mistakes as
      Message 2 of 10 , Jun 4, 2013
      • 0 Attachment
        --- In ntb-clips@yahoogroups.com, "John Shotsky" <jshotsky@...> wrote:
        >
        > I would start by getting a better OCR program. I use
        > OmniPage Pro (...)
        > I can tell you for certain that you will NEVER be able to use
        > NoteTab to 'automatically' fix all the errors lower grade
        > OCR software will make. It simply isn't predictable what the
        > errors will be (...)
        > For instance, nearly every word can be hyphenated in text, and not
        > always in the same fashion...

        I'm using Omnipage Pro but I still have to remove a lot of OCR-mistakes.

        I agree with you and Axel that it's impossible to capture all mistakes as particular entries in a clip (or a long list that would be called by a clip). I've been cleaning scanned text now for many years with NT. In the long run, you will get across an unlimited amount of mistakes following no rule.

        In my experience, the only way to resolve this problem is with RegEx patterns. Regarding those examples Roopakshi posted with his first message like...

        dis- tribution --> distribution

        I would use a pattern like...

        ^!Replace "[[:lower:]]\K-\x20(?=[[:upper:][:lower:]])" >> "" WARS

        in order to remove any '- ' (hyphen followed by a space) between a lower character and an upper or lower character.

        I think, however, there's still another problem with hyphenation and bad breaks. Look at words like...

        corr- elation
        attorney-at- law
        proof- reading
        dis- tribution
        Hewlett- Packard

        In line #2, 3, and 5, we have to remove the space only but not the hyphen. AFAIK, there are no linguistic rules that could help us to define patterns which could match any type of bad breaks -- at least in German (maybe it's easier in the English language?). That's why I'm working with a kind of "controlled replacement". Test the following clip against that list and you'll see how it works:

        ^!Jump Doc_Start
        :BadBreak
        ^!SetWizardWidth 60
        ^!Find "[[:lower:]]\K-\x20(?=[[:upper:][:lower:]])" RS
        ^!IfError Out
        ^!Goto ^?{Remove bad breaks:==_Space only|Hyphen+space|Skip}
        :Space only
        ^!InsertText "-"
        ^!Goto BadBreak
        :Hyphen+space
        ^!InsertText ""
        ^!Goto BadBreak
        :Skip
        ^!Goto BadBreak
        :Out
        ^!Info Finished!
        ^!Jump Doc_Start

        In practice, it's more complicated, of course. There are bad breaks at CRNL like...

        dis-
        tribution

        also with space or without space etc.

        Nevertheless, it's an interesting job to find out the patterns matching types of OCR-mistakes as far as possible! Meanwhile, my clip is working with about 400 patterns correcting misspellings, bad breaks, changing abbreviations, the style of writing dates, currencies etc, and it's working fine.

        Regards,
        Flo
      • John Shotsky
        In my years of working with hyphenated words, I came to some conclusions that have helped me. Originally, I removed all hyphens that were not associated with
        Message 3 of 10 , Jun 4, 2013
        • 0 Attachment
          In my years of working with hyphenated words, I came to some conclusions that have helped me. Originally, I removed all hyphens that
          were not associated with known hyphenated entities. That failed, because all the hyphens that are removed were not correct, and it
          became impossible to find the words where a hyphen was incorrectly removed, such as in strawberryvanilla (strawberry-vanilla). So I
          had to reverse the logic and make every hyphen STAY that couldn't be removed by use of regex codes. It is much easier to spot a word
          with a hyphen that doesn't belong than identify a word where one is missing, since there is an unending number of such words.

          The end result is that I have written regex that looks at prefixes, and eliminates those that apply, another that looks at
          suffixes, and eliminates those, then another set looks on both lines to see if it should be removed.

          I have found no single rule applying to letter case, as product names can be capitalized and hyphenated or not - it is all
          subjective. As a result, my users regularly send me words with hyphens that aren't correct, and I add their word in one of the three
          sections. I don't get many any more, but below is a copy of just the prefix code (all one line).
          ===
          ^!Replace
          "\b(abso|ac|accom|accompa|accus?|ad|ag|alco|alu|ama|ameri?|any|ap|apri|appe|approx?i?|aro?u?|arti|assem|atmo|avo?|bal?|bam|bev?|beau
          |bis|black|bot|broc|bub|bul|bur|but|cab|cali?|canta|cara|carb?o?|carbohy|card|cas|caul?i?|cav|cele|cen?|cer|ch?i|chila|choc?|cinn?a?
          |cit|cle|cof|col|com?n?|combin?|compli|confec|consis|cori|cot|cov|cre|cri|cro|crys|cui?l?r?|cus|defi|des?|devel?|diago|dif|diffi|dig
          es|dis?|disap|discol|discrimi|dol|dou|driz|eas|effi|elec|epiph|esp?e?|evapo?|every|excel|expec|ext?e?|experi?|fa|fen|fif|fla|fra|fre
          |frit?|gen|gi?ar|gaz|geo|ger|gnoc|grad?|guac?a?|granu?|haba|Hal|han|hap|heri|holi|homog|hon|hori|how|hum|hun|hydroge|illus|im|immed?
          |incorpo|indi|indiv|inex|inter|irre|jala|ji|juli|ker|kiel|kiwi|ko|lem|Les|lib|lico|lla|lun|manufac|mara|marga|mari|mathe|mayo?n?|mea
          ?|mem|mis?|micro|mod|moz|muf|mush?|nar|nat|nec|nei|neigh|noo|occ?a?|ol|opin|oppor|orig|out|over|pap|par|pas|pat|pep|phar|pista|piz?|
          pome|pos?|portabel|possi|pow|p?re|prepa|pres|prob|proc?|provo|pud|pun|pur|quesadil|ra|rasp|rea|reci?|recog|recol|recom|recon|refrig|
          [Rr]eg|rel[ai]|repre|resi|restau|ri|ridicu|saf|sal|sand|sauer|sea|sec|sei|self|sepa?|ser|seve?r?|shal|short?|sim|siz|so|some|spat|sp
          ec?|specifi?|spi|spo|sri|stan|stu|sub|sug|supe?|sur|sym|syn|tab?|table|tan|tar|tech|tem|therm?|thor|thriv|tives?|tol|toma|tor?|tras?
          |trans|tri|trun|tur|typ|un|uncom|undis|unwel|[Vv]alen|vanil|veg?e?|versa|vinai|vir|vita|with|water|week|won|Worces|yo|zuc)\K\{-\}"
          >> "" AIRSW
          ^!IfError Next Else Skip_-1

          If anyone wants my other two sets, please let me know, I can send it privately. There is actually a lot more to my hyphen clip set,
          because many words that should have hyphens aren't hyphenated in the source, and vice versa. So my code then goes into specific
          situations and words to either remove or add hyphens. For example, 'one-at-a-time' cannot be handled by the above methodology, but
          can only be handled by a specific clip designed to detect that set of letters, with or without (all of) the hyphens, with or without
          the spaces, and place the hyphens IF that is the correct thing to do. Thus, it may differ from the source, but it will be
          grammatically correct anyway.

          Regards,
          John
          RecipeTools Web Site: <http://recipetools.gotdns.com/> http://recipetools.gotdns.com/
          John's Mags Yahoo Group: <http://groups.yahoo.com/group/johnsmags/> http://groups.yahoo.com/group/johnsmags/

          From: ntb-clips@yahoogroups.com [mailto:ntb-clips@yahoogroups.com] On Behalf Of flo.gehrke
          Sent: Tuesday, June 04, 2013 07:27
          To: ntb-clips@yahoogroups.com
          Subject: Re: [Clip] Replacing All Words at Once


          --- In ntb-clips@yahoogroups.com <mailto:ntb-clips%40yahoogroups.com> , "John Shotsky" <jshotsky@...> wrote:
          >
          > I would start by getting a better OCR program. I use
          > OmniPage Pro (...)
          > I can tell you for certain that you will NEVER be able to use
          > NoteTab to 'automatically' fix all the errors lower grade
          > OCR software will make. It simply isn't predictable what the
          > errors will be (...)
          > For instance, nearly every word can be hyphenated in text, and not
          > always in the same fashion...

          I'm using Omnipage Pro but I still have to remove a lot of OCR-mistakes.

          I agree with you and Axel that it's impossible to capture all mistakes as particular entries in a clip (or a long list that would be
          called by a clip). I've been cleaning scanned text now for many years with NT. In the long run, you will get across an unlimited
          amount of mistakes following no rule.

          In my experience, the only way to resolve this problem is with RegEx patterns. Regarding those examples Roopakshi posted with his
          first message like...

          dis- tribution --> distribution

          I would use a pattern like...

          ^!Replace "[[:lower:]]\K-\x20(?=[[:upper:][:lower:]])" >> "" WARS

          in order to remove any '- ' (hyphen followed by a space) between a lower character and an upper or lower character.

          I think, however, there's still another problem with hyphenation and bad breaks. Look at words like...

          corr- elation
          attorney-at- law
          proof- reading
          dis- tribution
          Hewlett- Packard

          In line #2, 3, and 5, we have to remove the space only but not the hyphen. AFAIK, there are no linguistic rules that could help us
          to define patterns which could match any type of bad breaks -- at least in German (maybe it's easier in the English language?).
          That's why I'm working with a kind of "controlled replacement". Test the following clip against that list and you'll see how it
          works:

          ^!Jump Doc_Start
          :BadBreak
          ^!SetWizardWidth 60
          ^!Find "[[:lower:]]\K-\x20(?=[[:upper:][:lower:]])" RS
          ^!IfError Out
          ^!Goto ^?{Remove bad breaks:==_Space only|Hyphen+space|Skip}
          :Space only
          ^!InsertText "-"
          ^!Goto BadBreak
          :Hyphen+space
          ^!InsertText ""
          ^!Goto BadBreak
          :Skip
          ^!Goto BadBreak
          :Out
          ^!Info Finished!
          ^!Jump Doc_Start

          In practice, it's more complicated, of course. There are bad breaks at CRNL like...

          dis-
          tribution

          also with space or without space etc.

          Nevertheless, it's an interesting job to find out the patterns matching types of OCR-mistakes as far as possible! Meanwhile, my clip
          is working with about 400 patterns correcting misspellings, bad breaks, changing abbreviations, the style of writing dates,
          currencies etc, and it's working fine.

          Regards,
          Flo



          [Non-text portions of this message have been removed]
        • Axel Berger
          ... I use Abbyy Fine Reader Prof. version 6 and the results depend strongly on the quality of the scan. On good scans they re nearly perfect. ... I agree, if
          Message 4 of 10 , Jun 4, 2013
          • 0 Attachment
            John Shotsky wrote:
            > OmniPage is the most expensive program I own,

            I use Abbyy Fine Reader Prof. version 6 and the results depend strongly
            on the quality of the scan. On good scans they're nearly perfect.

            > The point I'm trying to make here is to fix the errors while
            > in the OCR tool,

            I agree, if only because it highlights all doubtful places in the text
            view and helps you deal with them quickly and efficiently. Some errors
            go undetected and unmarked, but not many.

            Axel
          • John Shotsky
            Try scanning and OCR on a few recipes out of a cookbook. It s like they never heard of fractions. That s the biggest problem with most of them - they are
            Message 5 of 10 , Jun 4, 2013
            • 0 Attachment
              Try scanning and OCR on a few recipes out of a cookbook. It's like they never heard of fractions. That's the biggest problem with
              most of them - they are designed for office documents, not technical documents. It's even OmniPage's weakest point, but it is FAR
              ahead of whatever is in second. Scan a few cookbooks, and you'll be pulling your hair out. (And looking for a new OCR product.) J
              Regards,
              John
              RecipeTools Web Site: <http://recipetools.gotdns.com/> http://recipetools.gotdns.com/
              John's Mags Yahoo Group: <http://groups.yahoo.com/group/johnsmags/> http://groups.yahoo.com/group/johnsmags/

              From: ntb-clips@yahoogroups.com [mailto:ntb-clips@yahoogroups.com] On Behalf Of Axel Berger
              Sent: Tuesday, June 04, 2013 09:17
              To: ntb-clips@yahoogroups.com
              Subject: Re: [Clip] Replacing All Words at Once


              John Shotsky wrote:
              > OmniPage is the most expensive program I own,

              I use Abbyy Fine Reader Prof. version 6 and the results depend strongly
              on the quality of the scan. On good scans they're nearly perfect.

              > The point I'm trying to make here is to fix the errors while
              > in the OCR tool,

              I agree, if only because it highlights all doubtful places in the text
              view and helps you deal with them quickly and efficiently. Some errors
              go undetected and unmarked, but not many.

              Axel



              [Non-text portions of this message have been removed]
            Your message has been successfully submitted and would be delivered to recipients shortly.