Loading ...
Sorry, an error occurred while loading the content.

Re: [Clip] Replacing All Words at Once

Expand Messages
  • Roopakshi Pathania
    Hi Adrian and others:   I went through the clip writing documentation. The below script works for me. Adrian, I hope this is what you meant. The idea is to
    Message 1 of 10 , Jun 4, 2013
    • 0 Attachment
      Hi Adrian and others:
       
      I went through the clip writing documentation. The below script works for me. Adrian, I hope this is what you meant. The idea is to execute the script whenever I open a new scanned document.
      Please suggest any improvements that can be made. Also while using "I", just as in the usual Replace dialog boxes, I cannot seem to get a lowercase entry to replace both lowercase and uppercase entries without changes to uppercase words. Is there a way to do this except for entering both uppercase and lowercase words separately?
       
      Thanks
      Roopakshi
       
      ^!Replace "eaming" >> "earning" [TIWAS]
       
      ^!Replace "modem" >> "modern" [TIWAS]
       
      ^!Replace "tra- ditional" >> "traditional" [TIWAS]
       
      ^!Replace "cor- porate" >> "corporate" [TIWAS]


      Sent from my Lenovo ThinkPad

      --- On Sat, 6/1/13, Adrian Worsfold <pluralist@...> wrote:


      From: Adrian Worsfold <pluralist@...>
      Subject: Re: [Clip] Replacing All Words at Once
      To: "ntb-clips" <ntb-clips@yahoogroups.com>
      Date: Saturday, June 1, 2013, 2:58 AM



       



      Hello Roopakshi Pathania


      All I'd do is build up lots of WAS condition finds and replaces as the scanning errors repeat themselves. It is a clip that grows as you find more repeatable errors.




      Adrian Worsfold

      http://www.pluralist.co.uk
      http://pluralistspeaks.blogspot.com
      pluralist@...
      31-05-2013
      ----- Received the following content -----
      From: Roopakshi Pathania
      Receiver: ntb-clips
      Time: 2013-05-31, 19:32:05
      Subject: [Clip] Replacing All Words at Once

      [Non-text portions of this message have been removed]








      [Non-text portions of this message have been removed]
    • Axel Berger
      ... No, but you can combine them using (X|x) and $n with R and without I. Also I fear the OCRs capacity for mistakes will soon exceed yours for maintaining the
      Message 2 of 10 , Jun 4, 2013
      • 0 Attachment
        Roopakshi Pathania wrote:
        > Is there a way to do this except for entering both uppercase
        > and lowercase words separately?

        No, but you can combine them using (X|x) and $n with R and without I.

        Also I fear the OCRs capacity for mistakes will soon exceed yours for
        maintaining the list, but that's just an opinion, it may work for you.

        Axel
      • John Shotsky
        I would start by getting a better OCR program. I use OmniPage Pro, but the pro version is not necessary. You can find older versions on eBay for not a lot of
        Message 3 of 10 , Jun 4, 2013
        • 0 Attachment
          I would start by getting a better OCR program. I use OmniPage Pro, but the pro version is not necessary. You can find older versions on eBay for not a lot of money. OmniPage has a training mode, so that you can teach it to remember corrections you make, so that it doesn't repeat them. After a while, it runs close to 100% accuracy. I have scanned thousands of pages and I can tell you for certain that you will NEVER be able to use NoteTab to 'automatically' fix all the errors lower grade OCR software will make. It simply isn't predictable what the errors will be. This stems from different fonts used in the source, different background colors on the source material, incorrect angle of scanning, etc. OCR works by trying to interpret a set of dots of a certain height and width. If there is even a tiny smear on the source material, it will produce a different pattern of dots, and that may be enough to cause errors. To that you can add special characters, fractions and a whole host of other issues that can affect scanning/recognition accuracy.

          I have used pretty much every OCR program available, and find that OmniPage is FAR better than any of the others. One nice thing is that it can recognize PDF files too, so if you have pdf files that are not 'text', but instead 'graphic', you can still get the text out. It also works with smartphone cameras to shoot photos of pages and recognize them. OmniPage is the most expensive program I own, and I always get the updates which come about every two years.

          The point I'm trying to make here is to fix the errors while in the OCR tool, because after that, it is nearly impossible. After the fact, you must have your source material in view while reading the text in NoteTab to identify problems. It is probably 10 times harder and longer to do in NoteTab than it is in the OCR tool. For instance, nearly every word can be hyphenated in text, and not always in the same fashion. It may be hyphenated after a prefix, before a suffix, or at any double letter or any other syllable. You will only discover them one-at-a-time, you can't write a whole dictionary of hyphenated words, it would take gigabytes and the rest of your life. And some hyphens should not be removed, such as reduced-fat, but in other cases, a word such as non-fat should not be hyphenated. So you can't always break at non-, nor at -fat. The point of all that is that you have to be looking at your paper and the result at the same time to know what needs to be changed.

          Regards,
          John
          RecipeTools Web Site: <http://recipetools.gotdns.com/> http://recipetools.gotdns.com/
          John's Mags Yahoo Group: <http://groups.yahoo.com/group/johnsmags/> http://groups.yahoo.com/group/johnsmags/

          From: ntb-clips@yahoogroups.com [mailto:ntb-clips@yahoogroups.com] On Behalf Of Roopakshi Pathania
          Sent: Tuesday, June 04, 2013 00:13
          To: ntb-clips@yahoogroups.com
          Subject: Re: [Clip] Replacing All Words at Once



          Hi Adrian and others:

          I went through the clip writing documentation. The below script works for me. Adrian, I hope this is what you meant. The idea is to execute the script whenever I open a new scanned document.
          Please suggest any improvements that can be made. Also while using "I", just as in the usual Replace dialog boxes, I cannot seem to get a lowercase entry to replace both lowercase and uppercase entries without changes to uppercase words. Is there a way to do this except for entering both uppercase and lowercase words separately?

          Thanks
          Roopakshi

          ^!Replace "eaming" >> "earning" [TIWAS]

          ^!Replace "modem" >> "modern" [TIWAS]

          ^!Replace "tra- ditional" >> "traditional" [TIWAS]

          ^!Replace "cor- porate" >> "corporate" [TIWAS]

          Sent from my Lenovo ThinkPad

          --- On Sat, 6/1/13, Adrian Worsfold <pluralist@... <mailto:pluralist%40pluralist.karoo.co.uk> > wrote:

          From: Adrian Worsfold <pluralist@... <mailto:pluralist%40pluralist.karoo.co.uk> >
          Subject: Re: [Clip] Replacing All Words at Once
          To: "ntb-clips" <ntb-clips@yahoogroups.com <mailto:ntb-clips%40yahoogroups.com> >
          Date: Saturday, June 1, 2013, 2:58 AM



          Hello Roopakshi Pathania

          All I'd do is build up lots of WAS condition finds and replaces as the scanning errors repeat themselves. It is a clip that grows as you find more repeatable errors.

          Adrian Worsfold

          http://www.pluralist.co.uk
          http://pluralistspeaks.blogspot.com
          pluralist@... <mailto:pluralist%40pluralist.karoo.co.uk>
          31-05-2013
          ----- Received the following content -----
          From: Roopakshi Pathania
          Receiver: ntb-clips
          Time: 2013-05-31, 19:32:05
          Subject: [Clip] Replacing All Words at Once

          [Non-text portions of this message have been removed]

          [Non-text portions of this message have been removed]



          [Non-text portions of this message have been removed]
        • Adrian Worsfold
          Hello Roopakshi Pathania I think the warning about variation from OCR scanned scripts is correct because each time you are going to have to add to the list of
          Message 4 of 10 , Jun 4, 2013
          • 0 Attachment
            Hello Roopakshi Pathania

            I think the warning about variation from OCR scanned scripts is correct because each time you are going to have to add to the list of replacements. And what if modem is modem and not modern?

            The similar list of replacements I have and added to is because the text I deal with is regular. I'm sure others can see this clip as cumbersome, and I don't do regex with its greater flexibility (unless offered and explained).

            H="Jobcentre excess remove"
            ^!Jump Doc_Start
            :LOOP
            ^!Replace "=3&setype=2&pg=4&AVSDM=" >> "" WAS
            ^!Replace "&pp=25&" >> "" WAS
            ^!Replace "pg=1" >> "" WAS
            ^!Replace "pg=2" >> "" WAS
            ^!Replace "pg=3" >> "" WAS
            ^!Replace "pg=4" >> "" WAS
            ^!Replace "pg=5" >> "" WAS
            ^!Replace "pg=6" >> "" WAS
            ^!Replace "pg=7" >> "" WAS
            ^!Replace "&where=HU7+4UD&sort=rv.dt.di&rad=20&rad_units=miles" >> "" WAS
            ^!Replace "&re=134" >> "" WAS
            ^!Replace "&re=3" >> "" WAS
            ^!Replace "&AVSDM=" >> "" WAS
            ^!IfError END
            ^!GoTo LOOP
            :END
            ^!Jump Doc_End

            (The Loop is probably unnecessary, and the Pg= could be looped with a counter, and looks like some repetition, but additions into the clip respond to what has been left over in later searched pages.)

            The result is a page URL is reduced to its essential (by which it can still be found again).




            Adrian Worsfold

            http://www.pluralist.co.uk
            http://pluralistspeaks.blogspot.com
            pluralist@...
            04-06-2013
            ----- Received the following content -----
            From: Roopakshi Pathania
            Receiver: ntb-clips
            Time: 2013-06-04, 08:12:36
            Subject: Re: [Clip] Replacing All Words at Once


            [Non-text portions of this message have been removed]
          • flo.gehrke
            ... I m using Omnipage Pro but I still have to remove a lot of OCR-mistakes. I agree with you and Axel that it s impossible to capture all mistakes as
            Message 5 of 10 , Jun 4, 2013
            • 0 Attachment
              --- In ntb-clips@yahoogroups.com, "John Shotsky" <jshotsky@...> wrote:
              >
              > I would start by getting a better OCR program. I use
              > OmniPage Pro (...)
              > I can tell you for certain that you will NEVER be able to use
              > NoteTab to 'automatically' fix all the errors lower grade
              > OCR software will make. It simply isn't predictable what the
              > errors will be (...)
              > For instance, nearly every word can be hyphenated in text, and not
              > always in the same fashion...

              I'm using Omnipage Pro but I still have to remove a lot of OCR-mistakes.

              I agree with you and Axel that it's impossible to capture all mistakes as particular entries in a clip (or a long list that would be called by a clip). I've been cleaning scanned text now for many years with NT. In the long run, you will get across an unlimited amount of mistakes following no rule.

              In my experience, the only way to resolve this problem is with RegEx patterns. Regarding those examples Roopakshi posted with his first message like...

              dis- tribution --> distribution

              I would use a pattern like...

              ^!Replace "[[:lower:]]\K-\x20(?=[[:upper:][:lower:]])" >> "" WARS

              in order to remove any '- ' (hyphen followed by a space) between a lower character and an upper or lower character.

              I think, however, there's still another problem with hyphenation and bad breaks. Look at words like...

              corr- elation
              attorney-at- law
              proof- reading
              dis- tribution
              Hewlett- Packard

              In line #2, 3, and 5, we have to remove the space only but not the hyphen. AFAIK, there are no linguistic rules that could help us to define patterns which could match any type of bad breaks -- at least in German (maybe it's easier in the English language?). That's why I'm working with a kind of "controlled replacement". Test the following clip against that list and you'll see how it works:

              ^!Jump Doc_Start
              :BadBreak
              ^!SetWizardWidth 60
              ^!Find "[[:lower:]]\K-\x20(?=[[:upper:][:lower:]])" RS
              ^!IfError Out
              ^!Goto ^?{Remove bad breaks:==_Space only|Hyphen+space|Skip}
              :Space only
              ^!InsertText "-"
              ^!Goto BadBreak
              :Hyphen+space
              ^!InsertText ""
              ^!Goto BadBreak
              :Skip
              ^!Goto BadBreak
              :Out
              ^!Info Finished!
              ^!Jump Doc_Start

              In practice, it's more complicated, of course. There are bad breaks at CRNL like...

              dis-
              tribution

              also with space or without space etc.

              Nevertheless, it's an interesting job to find out the patterns matching types of OCR-mistakes as far as possible! Meanwhile, my clip is working with about 400 patterns correcting misspellings, bad breaks, changing abbreviations, the style of writing dates, currencies etc, and it's working fine.

              Regards,
              Flo
            • John Shotsky
              In my years of working with hyphenated words, I came to some conclusions that have helped me. Originally, I removed all hyphens that were not associated with
              Message 6 of 10 , Jun 4, 2013
              • 0 Attachment
                In my years of working with hyphenated words, I came to some conclusions that have helped me. Originally, I removed all hyphens that
                were not associated with known hyphenated entities. That failed, because all the hyphens that are removed were not correct, and it
                became impossible to find the words where a hyphen was incorrectly removed, such as in strawberryvanilla (strawberry-vanilla). So I
                had to reverse the logic and make every hyphen STAY that couldn't be removed by use of regex codes. It is much easier to spot a word
                with a hyphen that doesn't belong than identify a word where one is missing, since there is an unending number of such words.

                The end result is that I have written regex that looks at prefixes, and eliminates those that apply, another that looks at
                suffixes, and eliminates those, then another set looks on both lines to see if it should be removed.

                I have found no single rule applying to letter case, as product names can be capitalized and hyphenated or not - it is all
                subjective. As a result, my users regularly send me words with hyphens that aren't correct, and I add their word in one of the three
                sections. I don't get many any more, but below is a copy of just the prefix code (all one line).
                ===
                ^!Replace
                "\b(abso|ac|accom|accompa|accus?|ad|ag|alco|alu|ama|ameri?|any|ap|apri|appe|approx?i?|aro?u?|arti|assem|atmo|avo?|bal?|bam|bev?|beau
                |bis|black|bot|broc|bub|bul|bur|but|cab|cali?|canta|cara|carb?o?|carbohy|card|cas|caul?i?|cav|cele|cen?|cer|ch?i|chila|choc?|cinn?a?
                |cit|cle|cof|col|com?n?|combin?|compli|confec|consis|cori|cot|cov|cre|cri|cro|crys|cui?l?r?|cus|defi|des?|devel?|diago|dif|diffi|dig
                es|dis?|disap|discol|discrimi|dol|dou|driz|eas|effi|elec|epiph|esp?e?|evapo?|every|excel|expec|ext?e?|experi?|fa|fen|fif|fla|fra|fre
                |frit?|gen|gi?ar|gaz|geo|ger|gnoc|grad?|guac?a?|granu?|haba|Hal|han|hap|heri|holi|homog|hon|hori|how|hum|hun|hydroge|illus|im|immed?
                |incorpo|indi|indiv|inex|inter|irre|jala|ji|juli|ker|kiel|kiwi|ko|lem|Les|lib|lico|lla|lun|manufac|mara|marga|mari|mathe|mayo?n?|mea
                ?|mem|mis?|micro|mod|moz|muf|mush?|nar|nat|nec|nei|neigh|noo|occ?a?|ol|opin|oppor|orig|out|over|pap|par|pas|pat|pep|phar|pista|piz?|
                pome|pos?|portabel|possi|pow|p?re|prepa|pres|prob|proc?|provo|pud|pun|pur|quesadil|ra|rasp|rea|reci?|recog|recol|recom|recon|refrig|
                [Rr]eg|rel[ai]|repre|resi|restau|ri|ridicu|saf|sal|sand|sauer|sea|sec|sei|self|sepa?|ser|seve?r?|shal|short?|sim|siz|so|some|spat|sp
                ec?|specifi?|spi|spo|sri|stan|stu|sub|sug|supe?|sur|sym|syn|tab?|table|tan|tar|tech|tem|therm?|thor|thriv|tives?|tol|toma|tor?|tras?
                |trans|tri|trun|tur|typ|un|uncom|undis|unwel|[Vv]alen|vanil|veg?e?|versa|vinai|vir|vita|with|water|week|won|Worces|yo|zuc)\K\{-\}"
                >> "" AIRSW
                ^!IfError Next Else Skip_-1

                If anyone wants my other two sets, please let me know, I can send it privately. There is actually a lot more to my hyphen clip set,
                because many words that should have hyphens aren't hyphenated in the source, and vice versa. So my code then goes into specific
                situations and words to either remove or add hyphens. For example, 'one-at-a-time' cannot be handled by the above methodology, but
                can only be handled by a specific clip designed to detect that set of letters, with or without (all of) the hyphens, with or without
                the spaces, and place the hyphens IF that is the correct thing to do. Thus, it may differ from the source, but it will be
                grammatically correct anyway.

                Regards,
                John
                RecipeTools Web Site: <http://recipetools.gotdns.com/> http://recipetools.gotdns.com/
                John's Mags Yahoo Group: <http://groups.yahoo.com/group/johnsmags/> http://groups.yahoo.com/group/johnsmags/

                From: ntb-clips@yahoogroups.com [mailto:ntb-clips@yahoogroups.com] On Behalf Of flo.gehrke
                Sent: Tuesday, June 04, 2013 07:27
                To: ntb-clips@yahoogroups.com
                Subject: Re: [Clip] Replacing All Words at Once


                --- In ntb-clips@yahoogroups.com <mailto:ntb-clips%40yahoogroups.com> , "John Shotsky" <jshotsky@...> wrote:
                >
                > I would start by getting a better OCR program. I use
                > OmniPage Pro (...)
                > I can tell you for certain that you will NEVER be able to use
                > NoteTab to 'automatically' fix all the errors lower grade
                > OCR software will make. It simply isn't predictable what the
                > errors will be (...)
                > For instance, nearly every word can be hyphenated in text, and not
                > always in the same fashion...

                I'm using Omnipage Pro but I still have to remove a lot of OCR-mistakes.

                I agree with you and Axel that it's impossible to capture all mistakes as particular entries in a clip (or a long list that would be
                called by a clip). I've been cleaning scanned text now for many years with NT. In the long run, you will get across an unlimited
                amount of mistakes following no rule.

                In my experience, the only way to resolve this problem is with RegEx patterns. Regarding those examples Roopakshi posted with his
                first message like...

                dis- tribution --> distribution

                I would use a pattern like...

                ^!Replace "[[:lower:]]\K-\x20(?=[[:upper:][:lower:]])" >> "" WARS

                in order to remove any '- ' (hyphen followed by a space) between a lower character and an upper or lower character.

                I think, however, there's still another problem with hyphenation and bad breaks. Look at words like...

                corr- elation
                attorney-at- law
                proof- reading
                dis- tribution
                Hewlett- Packard

                In line #2, 3, and 5, we have to remove the space only but not the hyphen. AFAIK, there are no linguistic rules that could help us
                to define patterns which could match any type of bad breaks -- at least in German (maybe it's easier in the English language?).
                That's why I'm working with a kind of "controlled replacement". Test the following clip against that list and you'll see how it
                works:

                ^!Jump Doc_Start
                :BadBreak
                ^!SetWizardWidth 60
                ^!Find "[[:lower:]]\K-\x20(?=[[:upper:][:lower:]])" RS
                ^!IfError Out
                ^!Goto ^?{Remove bad breaks:==_Space only|Hyphen+space|Skip}
                :Space only
                ^!InsertText "-"
                ^!Goto BadBreak
                :Hyphen+space
                ^!InsertText ""
                ^!Goto BadBreak
                :Skip
                ^!Goto BadBreak
                :Out
                ^!Info Finished!
                ^!Jump Doc_Start

                In practice, it's more complicated, of course. There are bad breaks at CRNL like...

                dis-
                tribution

                also with space or without space etc.

                Nevertheless, it's an interesting job to find out the patterns matching types of OCR-mistakes as far as possible! Meanwhile, my clip
                is working with about 400 patterns correcting misspellings, bad breaks, changing abbreviations, the style of writing dates,
                currencies etc, and it's working fine.

                Regards,
                Flo



                [Non-text portions of this message have been removed]
              • Axel Berger
                ... I use Abbyy Fine Reader Prof. version 6 and the results depend strongly on the quality of the scan. On good scans they re nearly perfect. ... I agree, if
                Message 7 of 10 , Jun 4, 2013
                • 0 Attachment
                  John Shotsky wrote:
                  > OmniPage is the most expensive program I own,

                  I use Abbyy Fine Reader Prof. version 6 and the results depend strongly
                  on the quality of the scan. On good scans they're nearly perfect.

                  > The point I'm trying to make here is to fix the errors while
                  > in the OCR tool,

                  I agree, if only because it highlights all doubtful places in the text
                  view and helps you deal with them quickly and efficiently. Some errors
                  go undetected and unmarked, but not many.

                  Axel
                • John Shotsky
                  Try scanning and OCR on a few recipes out of a cookbook. It s like they never heard of fractions. That s the biggest problem with most of them - they are
                  Message 8 of 10 , Jun 4, 2013
                  • 0 Attachment
                    Try scanning and OCR on a few recipes out of a cookbook. It's like they never heard of fractions. That's the biggest problem with
                    most of them - they are designed for office documents, not technical documents. It's even OmniPage's weakest point, but it is FAR
                    ahead of whatever is in second. Scan a few cookbooks, and you'll be pulling your hair out. (And looking for a new OCR product.) J
                    Regards,
                    John
                    RecipeTools Web Site: <http://recipetools.gotdns.com/> http://recipetools.gotdns.com/
                    John's Mags Yahoo Group: <http://groups.yahoo.com/group/johnsmags/> http://groups.yahoo.com/group/johnsmags/

                    From: ntb-clips@yahoogroups.com [mailto:ntb-clips@yahoogroups.com] On Behalf Of Axel Berger
                    Sent: Tuesday, June 04, 2013 09:17
                    To: ntb-clips@yahoogroups.com
                    Subject: Re: [Clip] Replacing All Words at Once


                    John Shotsky wrote:
                    > OmniPage is the most expensive program I own,

                    I use Abbyy Fine Reader Prof. version 6 and the results depend strongly
                    on the quality of the scan. On good scans they're nearly perfect.

                    > The point I'm trying to make here is to fix the errors while
                    > in the OCR tool,

                    I agree, if only because it highlights all doubtful places in the text
                    view and helps you deal with them quickly and efficiently. Some errors
                    go undetected and unmarked, but not many.

                    Axel



                    [Non-text portions of this message have been removed]
                  Your message has been successfully submitted and would be delivered to recipients shortly.