Loading ...
Sorry, an error occurred while loading the content.

RE: [Clip] Replacing All Words at Once

Expand Messages
  • John Shotsky
    I would start by getting a better OCR program. I use OmniPage Pro, but the pro version is not necessary. You can find older versions on eBay for not a lot of
    Message 1 of 10 , Jun 4, 2013
    • 0 Attachment
      I would start by getting a better OCR program. I use OmniPage Pro, but the pro version is not necessary. You can find older versions on eBay for not a lot of money. OmniPage has a training mode, so that you can teach it to remember corrections you make, so that it doesn't repeat them. After a while, it runs close to 100% accuracy. I have scanned thousands of pages and I can tell you for certain that you will NEVER be able to use NoteTab to 'automatically' fix all the errors lower grade OCR software will make. It simply isn't predictable what the errors will be. This stems from different fonts used in the source, different background colors on the source material, incorrect angle of scanning, etc. OCR works by trying to interpret a set of dots of a certain height and width. If there is even a tiny smear on the source material, it will produce a different pattern of dots, and that may be enough to cause errors. To that you can add special characters, fractions and a whole host of other issues that can affect scanning/recognition accuracy.

      I have used pretty much every OCR program available, and find that OmniPage is FAR better than any of the others. One nice thing is that it can recognize PDF files too, so if you have pdf files that are not 'text', but instead 'graphic', you can still get the text out. It also works with smartphone cameras to shoot photos of pages and recognize them. OmniPage is the most expensive program I own, and I always get the updates which come about every two years.

      The point I'm trying to make here is to fix the errors while in the OCR tool, because after that, it is nearly impossible. After the fact, you must have your source material in view while reading the text in NoteTab to identify problems. It is probably 10 times harder and longer to do in NoteTab than it is in the OCR tool. For instance, nearly every word can be hyphenated in text, and not always in the same fashion. It may be hyphenated after a prefix, before a suffix, or at any double letter or any other syllable. You will only discover them one-at-a-time, you can't write a whole dictionary of hyphenated words, it would take gigabytes and the rest of your life. And some hyphens should not be removed, such as reduced-fat, but in other cases, a word such as non-fat should not be hyphenated. So you can't always break at non-, nor at -fat. The point of all that is that you have to be looking at your paper and the result at the same time to know what needs to be changed.

      Regards,
      John
      RecipeTools Web Site: <http://recipetools.gotdns.com/> http://recipetools.gotdns.com/
      John's Mags Yahoo Group: <http://groups.yahoo.com/group/johnsmags/> http://groups.yahoo.com/group/johnsmags/

      From: ntb-clips@yahoogroups.com [mailto:ntb-clips@yahoogroups.com] On Behalf Of Roopakshi Pathania
      Sent: Tuesday, June 04, 2013 00:13
      To: ntb-clips@yahoogroups.com
      Subject: Re: [Clip] Replacing All Words at Once



      Hi Adrian and others:

      I went through the clip writing documentation. The below script works for me. Adrian, I hope this is what you meant. The idea is to execute the script whenever I open a new scanned document.
      Please suggest any improvements that can be made. Also while using "I", just as in the usual Replace dialog boxes, I cannot seem to get a lowercase entry to replace both lowercase and uppercase entries without changes to uppercase words. Is there a way to do this except for entering both uppercase and lowercase words separately?

      Thanks
      Roopakshi

      ^!Replace "eaming" >> "earning" [TIWAS]

      ^!Replace "modem" >> "modern" [TIWAS]

      ^!Replace "tra- ditional" >> "traditional" [TIWAS]

      ^!Replace "cor- porate" >> "corporate" [TIWAS]

      Sent from my Lenovo ThinkPad

      --- On Sat, 6/1/13, Adrian Worsfold <pluralist@... <mailto:pluralist%40pluralist.karoo.co.uk> > wrote:

      From: Adrian Worsfold <pluralist@... <mailto:pluralist%40pluralist.karoo.co.uk> >
      Subject: Re: [Clip] Replacing All Words at Once
      To: "ntb-clips" <ntb-clips@yahoogroups.com <mailto:ntb-clips%40yahoogroups.com> >
      Date: Saturday, June 1, 2013, 2:58 AM



      Hello Roopakshi Pathania

      All I'd do is build up lots of WAS condition finds and replaces as the scanning errors repeat themselves. It is a clip that grows as you find more repeatable errors.

      Adrian Worsfold

      http://www.pluralist.co.uk
      http://pluralistspeaks.blogspot.com
      pluralist@... <mailto:pluralist%40pluralist.karoo.co.uk>
      31-05-2013
      ----- Received the following content -----
      From: Roopakshi Pathania
      Receiver: ntb-clips
      Time: 2013-05-31, 19:32:05
      Subject: [Clip] Replacing All Words at Once

      [Non-text portions of this message have been removed]

      [Non-text portions of this message have been removed]



      [Non-text portions of this message have been removed]
    • Adrian Worsfold
      Hello Roopakshi Pathania I think the warning about variation from OCR scanned scripts is correct because each time you are going to have to add to the list of
      Message 2 of 10 , Jun 4, 2013
      • 0 Attachment
        Hello Roopakshi Pathania

        I think the warning about variation from OCR scanned scripts is correct because each time you are going to have to add to the list of replacements. And what if modem is modem and not modern?

        The similar list of replacements I have and added to is because the text I deal with is regular. I'm sure others can see this clip as cumbersome, and I don't do regex with its greater flexibility (unless offered and explained).

        H="Jobcentre excess remove"
        ^!Jump Doc_Start
        :LOOP
        ^!Replace "=3&setype=2&pg=4&AVSDM=" >> "" WAS
        ^!Replace "&pp=25&" >> "" WAS
        ^!Replace "pg=1" >> "" WAS
        ^!Replace "pg=2" >> "" WAS
        ^!Replace "pg=3" >> "" WAS
        ^!Replace "pg=4" >> "" WAS
        ^!Replace "pg=5" >> "" WAS
        ^!Replace "pg=6" >> "" WAS
        ^!Replace "pg=7" >> "" WAS
        ^!Replace "&where=HU7+4UD&sort=rv.dt.di&rad=20&rad_units=miles" >> "" WAS
        ^!Replace "&re=134" >> "" WAS
        ^!Replace "&re=3" >> "" WAS
        ^!Replace "&AVSDM=" >> "" WAS
        ^!IfError END
        ^!GoTo LOOP
        :END
        ^!Jump Doc_End

        (The Loop is probably unnecessary, and the Pg= could be looped with a counter, and looks like some repetition, but additions into the clip respond to what has been left over in later searched pages.)

        The result is a page URL is reduced to its essential (by which it can still be found again).




        Adrian Worsfold

        http://www.pluralist.co.uk
        http://pluralistspeaks.blogspot.com
        pluralist@...
        04-06-2013
        ----- Received the following content -----
        From: Roopakshi Pathania
        Receiver: ntb-clips
        Time: 2013-06-04, 08:12:36
        Subject: Re: [Clip] Replacing All Words at Once


        [Non-text portions of this message have been removed]
      • flo.gehrke
        ... I m using Omnipage Pro but I still have to remove a lot of OCR-mistakes. I agree with you and Axel that it s impossible to capture all mistakes as
        Message 3 of 10 , Jun 4, 2013
        • 0 Attachment
          --- In ntb-clips@yahoogroups.com, "John Shotsky" <jshotsky@...> wrote:
          >
          > I would start by getting a better OCR program. I use
          > OmniPage Pro (...)
          > I can tell you for certain that you will NEVER be able to use
          > NoteTab to 'automatically' fix all the errors lower grade
          > OCR software will make. It simply isn't predictable what the
          > errors will be (...)
          > For instance, nearly every word can be hyphenated in text, and not
          > always in the same fashion...

          I'm using Omnipage Pro but I still have to remove a lot of OCR-mistakes.

          I agree with you and Axel that it's impossible to capture all mistakes as particular entries in a clip (or a long list that would be called by a clip). I've been cleaning scanned text now for many years with NT. In the long run, you will get across an unlimited amount of mistakes following no rule.

          In my experience, the only way to resolve this problem is with RegEx patterns. Regarding those examples Roopakshi posted with his first message like...

          dis- tribution --> distribution

          I would use a pattern like...

          ^!Replace "[[:lower:]]\K-\x20(?=[[:upper:][:lower:]])" >> "" WARS

          in order to remove any '- ' (hyphen followed by a space) between a lower character and an upper or lower character.

          I think, however, there's still another problem with hyphenation and bad breaks. Look at words like...

          corr- elation
          attorney-at- law
          proof- reading
          dis- tribution
          Hewlett- Packard

          In line #2, 3, and 5, we have to remove the space only but not the hyphen. AFAIK, there are no linguistic rules that could help us to define patterns which could match any type of bad breaks -- at least in German (maybe it's easier in the English language?). That's why I'm working with a kind of "controlled replacement". Test the following clip against that list and you'll see how it works:

          ^!Jump Doc_Start
          :BadBreak
          ^!SetWizardWidth 60
          ^!Find "[[:lower:]]\K-\x20(?=[[:upper:][:lower:]])" RS
          ^!IfError Out
          ^!Goto ^?{Remove bad breaks:==_Space only|Hyphen+space|Skip}
          :Space only
          ^!InsertText "-"
          ^!Goto BadBreak
          :Hyphen+space
          ^!InsertText ""
          ^!Goto BadBreak
          :Skip
          ^!Goto BadBreak
          :Out
          ^!Info Finished!
          ^!Jump Doc_Start

          In practice, it's more complicated, of course. There are bad breaks at CRNL like...

          dis-
          tribution

          also with space or without space etc.

          Nevertheless, it's an interesting job to find out the patterns matching types of OCR-mistakes as far as possible! Meanwhile, my clip is working with about 400 patterns correcting misspellings, bad breaks, changing abbreviations, the style of writing dates, currencies etc, and it's working fine.

          Regards,
          Flo
        • John Shotsky
          In my years of working with hyphenated words, I came to some conclusions that have helped me. Originally, I removed all hyphens that were not associated with
          Message 4 of 10 , Jun 4, 2013
          • 0 Attachment
            In my years of working with hyphenated words, I came to some conclusions that have helped me. Originally, I removed all hyphens that
            were not associated with known hyphenated entities. That failed, because all the hyphens that are removed were not correct, and it
            became impossible to find the words where a hyphen was incorrectly removed, such as in strawberryvanilla (strawberry-vanilla). So I
            had to reverse the logic and make every hyphen STAY that couldn't be removed by use of regex codes. It is much easier to spot a word
            with a hyphen that doesn't belong than identify a word where one is missing, since there is an unending number of such words.

            The end result is that I have written regex that looks at prefixes, and eliminates those that apply, another that looks at
            suffixes, and eliminates those, then another set looks on both lines to see if it should be removed.

            I have found no single rule applying to letter case, as product names can be capitalized and hyphenated or not - it is all
            subjective. As a result, my users regularly send me words with hyphens that aren't correct, and I add their word in one of the three
            sections. I don't get many any more, but below is a copy of just the prefix code (all one line).
            ===
            ^!Replace
            "\b(abso|ac|accom|accompa|accus?|ad|ag|alco|alu|ama|ameri?|any|ap|apri|appe|approx?i?|aro?u?|arti|assem|atmo|avo?|bal?|bam|bev?|beau
            |bis|black|bot|broc|bub|bul|bur|but|cab|cali?|canta|cara|carb?o?|carbohy|card|cas|caul?i?|cav|cele|cen?|cer|ch?i|chila|choc?|cinn?a?
            |cit|cle|cof|col|com?n?|combin?|compli|confec|consis|cori|cot|cov|cre|cri|cro|crys|cui?l?r?|cus|defi|des?|devel?|diago|dif|diffi|dig
            es|dis?|disap|discol|discrimi|dol|dou|driz|eas|effi|elec|epiph|esp?e?|evapo?|every|excel|expec|ext?e?|experi?|fa|fen|fif|fla|fra|fre
            |frit?|gen|gi?ar|gaz|geo|ger|gnoc|grad?|guac?a?|granu?|haba|Hal|han|hap|heri|holi|homog|hon|hori|how|hum|hun|hydroge|illus|im|immed?
            |incorpo|indi|indiv|inex|inter|irre|jala|ji|juli|ker|kiel|kiwi|ko|lem|Les|lib|lico|lla|lun|manufac|mara|marga|mari|mathe|mayo?n?|mea
            ?|mem|mis?|micro|mod|moz|muf|mush?|nar|nat|nec|nei|neigh|noo|occ?a?|ol|opin|oppor|orig|out|over|pap|par|pas|pat|pep|phar|pista|piz?|
            pome|pos?|portabel|possi|pow|p?re|prepa|pres|prob|proc?|provo|pud|pun|pur|quesadil|ra|rasp|rea|reci?|recog|recol|recom|recon|refrig|
            [Rr]eg|rel[ai]|repre|resi|restau|ri|ridicu|saf|sal|sand|sauer|sea|sec|sei|self|sepa?|ser|seve?r?|shal|short?|sim|siz|so|some|spat|sp
            ec?|specifi?|spi|spo|sri|stan|stu|sub|sug|supe?|sur|sym|syn|tab?|table|tan|tar|tech|tem|therm?|thor|thriv|tives?|tol|toma|tor?|tras?
            |trans|tri|trun|tur|typ|un|uncom|undis|unwel|[Vv]alen|vanil|veg?e?|versa|vinai|vir|vita|with|water|week|won|Worces|yo|zuc)\K\{-\}"
            >> "" AIRSW
            ^!IfError Next Else Skip_-1

            If anyone wants my other two sets, please let me know, I can send it privately. There is actually a lot more to my hyphen clip set,
            because many words that should have hyphens aren't hyphenated in the source, and vice versa. So my code then goes into specific
            situations and words to either remove or add hyphens. For example, 'one-at-a-time' cannot be handled by the above methodology, but
            can only be handled by a specific clip designed to detect that set of letters, with or without (all of) the hyphens, with or without
            the spaces, and place the hyphens IF that is the correct thing to do. Thus, it may differ from the source, but it will be
            grammatically correct anyway.

            Regards,
            John
            RecipeTools Web Site: <http://recipetools.gotdns.com/> http://recipetools.gotdns.com/
            John's Mags Yahoo Group: <http://groups.yahoo.com/group/johnsmags/> http://groups.yahoo.com/group/johnsmags/

            From: ntb-clips@yahoogroups.com [mailto:ntb-clips@yahoogroups.com] On Behalf Of flo.gehrke
            Sent: Tuesday, June 04, 2013 07:27
            To: ntb-clips@yahoogroups.com
            Subject: Re: [Clip] Replacing All Words at Once


            --- In ntb-clips@yahoogroups.com <mailto:ntb-clips%40yahoogroups.com> , "John Shotsky" <jshotsky@...> wrote:
            >
            > I would start by getting a better OCR program. I use
            > OmniPage Pro (...)
            > I can tell you for certain that you will NEVER be able to use
            > NoteTab to 'automatically' fix all the errors lower grade
            > OCR software will make. It simply isn't predictable what the
            > errors will be (...)
            > For instance, nearly every word can be hyphenated in text, and not
            > always in the same fashion...

            I'm using Omnipage Pro but I still have to remove a lot of OCR-mistakes.

            I agree with you and Axel that it's impossible to capture all mistakes as particular entries in a clip (or a long list that would be
            called by a clip). I've been cleaning scanned text now for many years with NT. In the long run, you will get across an unlimited
            amount of mistakes following no rule.

            In my experience, the only way to resolve this problem is with RegEx patterns. Regarding those examples Roopakshi posted with his
            first message like...

            dis- tribution --> distribution

            I would use a pattern like...

            ^!Replace "[[:lower:]]\K-\x20(?=[[:upper:][:lower:]])" >> "" WARS

            in order to remove any '- ' (hyphen followed by a space) between a lower character and an upper or lower character.

            I think, however, there's still another problem with hyphenation and bad breaks. Look at words like...

            corr- elation
            attorney-at- law
            proof- reading
            dis- tribution
            Hewlett- Packard

            In line #2, 3, and 5, we have to remove the space only but not the hyphen. AFAIK, there are no linguistic rules that could help us
            to define patterns which could match any type of bad breaks -- at least in German (maybe it's easier in the English language?).
            That's why I'm working with a kind of "controlled replacement". Test the following clip against that list and you'll see how it
            works:

            ^!Jump Doc_Start
            :BadBreak
            ^!SetWizardWidth 60
            ^!Find "[[:lower:]]\K-\x20(?=[[:upper:][:lower:]])" RS
            ^!IfError Out
            ^!Goto ^?{Remove bad breaks:==_Space only|Hyphen+space|Skip}
            :Space only
            ^!InsertText "-"
            ^!Goto BadBreak
            :Hyphen+space
            ^!InsertText ""
            ^!Goto BadBreak
            :Skip
            ^!Goto BadBreak
            :Out
            ^!Info Finished!
            ^!Jump Doc_Start

            In practice, it's more complicated, of course. There are bad breaks at CRNL like...

            dis-
            tribution

            also with space or without space etc.

            Nevertheless, it's an interesting job to find out the patterns matching types of OCR-mistakes as far as possible! Meanwhile, my clip
            is working with about 400 patterns correcting misspellings, bad breaks, changing abbreviations, the style of writing dates,
            currencies etc, and it's working fine.

            Regards,
            Flo



            [Non-text portions of this message have been removed]
          • Axel Berger
            ... I use Abbyy Fine Reader Prof. version 6 and the results depend strongly on the quality of the scan. On good scans they re nearly perfect. ... I agree, if
            Message 5 of 10 , Jun 4, 2013
            • 0 Attachment
              John Shotsky wrote:
              > OmniPage is the most expensive program I own,

              I use Abbyy Fine Reader Prof. version 6 and the results depend strongly
              on the quality of the scan. On good scans they're nearly perfect.

              > The point I'm trying to make here is to fix the errors while
              > in the OCR tool,

              I agree, if only because it highlights all doubtful places in the text
              view and helps you deal with them quickly and efficiently. Some errors
              go undetected and unmarked, but not many.

              Axel
            • John Shotsky
              Try scanning and OCR on a few recipes out of a cookbook. It s like they never heard of fractions. That s the biggest problem with most of them - they are
              Message 6 of 10 , Jun 4, 2013
              • 0 Attachment
                Try scanning and OCR on a few recipes out of a cookbook. It's like they never heard of fractions. That's the biggest problem with
                most of them - they are designed for office documents, not technical documents. It's even OmniPage's weakest point, but it is FAR
                ahead of whatever is in second. Scan a few cookbooks, and you'll be pulling your hair out. (And looking for a new OCR product.) J
                Regards,
                John
                RecipeTools Web Site: <http://recipetools.gotdns.com/> http://recipetools.gotdns.com/
                John's Mags Yahoo Group: <http://groups.yahoo.com/group/johnsmags/> http://groups.yahoo.com/group/johnsmags/

                From: ntb-clips@yahoogroups.com [mailto:ntb-clips@yahoogroups.com] On Behalf Of Axel Berger
                Sent: Tuesday, June 04, 2013 09:17
                To: ntb-clips@yahoogroups.com
                Subject: Re: [Clip] Replacing All Words at Once


                John Shotsky wrote:
                > OmniPage is the most expensive program I own,

                I use Abbyy Fine Reader Prof. version 6 and the results depend strongly
                on the quality of the scan. On good scans they're nearly perfect.

                > The point I'm trying to make here is to fix the errors while
                > in the OCR tool,

                I agree, if only because it highlights all doubtful places in the text
                view and helps you deal with them quickly and efficiently. Some errors
                go undetected and unmarked, but not many.

                Axel



                [Non-text portions of this message have been removed]
              Your message has been successfully submitted and would be delivered to recipients shortly.