Loading ...
Sorry, an error occurred while loading the content.

changing line length of OCR scanned text

Expand Messages
  • Mike Breiding
    Greetings, I have OCR scanned docs where line lengths vary. I would like to have each paragraph be unbroken with. Is this a clip solution? Thanks, -Mike
    Message 1 of 12 , Apr 23 4:27 AM
    • 0 Attachment
      Greetings,

      I have OCR scanned docs where line lengths vary.
      I would like to have each paragraph be unbroken with.

      Is this a clip solution?

      Thanks,
      -Mike
      ===============
      SAMPLE 1
      West Virginia's allotment from the Land and Water Conservation Fund
      (LWCF) the Bureau of Outdoor Recreation (BOR) is disbursed at a ratio of
      approximately 60% for state-operated projects. W for community recreational
      facilities.

      SAMPLE 2
      During 50 years of bird watching I have seen the gradual diminishing and
      at times
      cataclysmic destruction of the natural environment[consistent
      destruction of the habitat]. Aside from all the poisons spewed into the
      atmosphere and waters plus the havoc that has been wreaked upon the
      landscape, [and consistent destruction of habitat]there has been a
      pervasive shrinking of living space for wild creatures.
    • Jeff Scism
      How about Selecting the section and using Ctrl+J (Join lines)? ... -- Jeffery G. Scism, IBSSG ~~ Proponents of each side are vying with determination to prove
      Message 2 of 12 , Apr 23 5:44 AM
      • 0 Attachment
        How about Selecting the section and using Ctrl+J (Join lines)?



        Mike Breiding wrote:
        >
        >
        > Greetings,
        >
        > I have OCR scanned docs where line lengths vary.
        > I would like to have each paragraph be unbroken with.
        >
        > Is this a clip solution?
        >
        > Thanks,
        > -Mike
        > ===============
        > SAMPLE 1
        > West Virginia's allotment from the Land and Water Conservation Fund
        > (LWCF) the Bureau of Outdoor Recreation (BOR) is disbursed at a ratio of
        > approximately 60% for state-operated projects. W for community
        > recreational
        > facilities.
        >
        > SAMPLE 2
        > During 50 years of bird watching I have seen the gradual diminishing and
        > at times
        > cataclysmic destruction of the natural environment[consistent
        > destruction of the habitat]. Aside from all the poisons spewed into the
        > atmosphere and waters plus the havoc that has been wreaked upon the
        > landscape, [and consistent destruction of habitat]there has been a
        > pervasive shrinking of living space for wild creatures.
        >
        >


        --


        Jeffery G. Scism, IBSSG
        ~~

        "Proponents of each side are vying with determination to prove their ignorance is greater than the other."

        President Andrew Jackson, discussing a bill going through the US Congress.



        Visit http://ibssg.org/
        For The Blacksheep website, MORE...

        Putnam County Indiana Biographies and Obituaries
        http://ingenweb.org/inputnam/bios/

        Montgomery County Indiana Biographies and Obituaries
        http://ingenweb.org/inmontgomery/bios/

        Fountain County Indiana Biographies and Obituaries
        http://ingenweb.org/infountain/vitals/bios/
      • Mike Breiding
        ... Hi Jeff, I did not know about Ctrl+J (Join lines) . This works! But, is there a way to automate the process so it joins each paragraph seperately? When
        Message 3 of 12 , Apr 23 7:11 AM
        • 0 Attachment
          Jeff Scism wrote:
          > How about Selecting the section and using Ctrl+J (Join lines)?
          >
          > Mike Breiding wrote:
          >
          >> Greetings,
          >> I have OCR scanned docs where line lengths vary.
          >> I would like to have each paragraph be unbroken with.
          >> Is this a clip solution?
          Hi Jeff,
          I did not know about Ctrl+J (Join lines) .
          This works! But, is there a way to automate the process so it joins each
          paragraph seperately?
          When there is a block of text like below it is easy for me to
          distinguish paragraphs, but how would NT find them for processing I wonder?
          Thanks,
          -Mike

          "I have come to rely upon, or more aptly put, resorted to, are: (1)
          cemeteries and: (2)
          railroad right-of-ways.
          For instance, when I was in military service during World War II, I
          learned a good
          place to look for birds was in the bigger and older cemeteries of the
          larger towns and
          cities. Many of the larger cemeteries are like and oasis surrounded by
          all types of
          urbanization. The older ones usually attract birds because of the
          variety and stages of
          plant life.
          The older cemeteries are usually in or near the better residential
          sections which
          generally are landscaped with some types of trees and shrubs that
          provide food and cover
          for birds and other wildlife."
        • Don - HtmlFixIt.com
          ... select paragraph join jump next paragraph (may need to be with a jump select end) may need to see if it is a blank line as that isn t an end of paragraph
          Message 4 of 12 , Apr 23 7:15 AM
          • 0 Attachment
            Mike Breiding wrote:
            > Jeff Scism wrote:
            >> How about Selecting the section and using Ctrl+J (Join lines)?
            >>

            select paragraph
            join
            jump next paragraph (may need to be with a jump select end)
            may need to see if it is a blank line as that isn't an end of paragraph

            with a little fiddling easy to do I think

            I use Control + J often.

            It works well on emailed content that gets line wrapped with hard
            returns inserted.
          • loro
            ... Ctrl+A Ctrl+J As long as there is at least one blank line between the blocks, that is. Lotta
            Message 5 of 12 , Apr 23 7:46 AM
            • 0 Attachment
              Mike Breiding wrote:
              >Jeff Scism wrote:
              > > How about Selecting the section and using Ctrl+J (Join lines)?

              >This works! But, is there a way to automate the process so it joins each
              >paragraph seperately?

              Ctrl+A Ctrl+J

              As long as there is at least one blank line between the blocks, that is.

              Lotta
            • hsavage
              ... Mike, Here s a short clip that should solve your problem if it s formatted as in your example. ... H= FormatLines ^!Set %ww%=^$IsWordWrap$ ^!SetWordWrap 0
              Message 6 of 12 , Apr 23 7:46 AM
              • 0 Attachment
                Mike Breiding wrote:
                > Greetings,
                >
                > I have OCR scanned docs where line lengths vary.
                > I would like to have each paragraph be unbroken with.
                >
                > Is this a clip solution?
                >
                > Thanks,
                > -Mike

                Mike,

                Here's a short clip that should solve your problem if it's formatted as
                in your example.

                -------------------
                H="FormatLines"
                ^!Set %ww%=^$IsWordWrap$
                ^!SetWordWrap 0
                ;
                ^!Replace "^p^p" >> "zxzx" TIWSA
                ^!Select ALL
                ^!Menu Modify/Lines/Join Lines
                ^!Replace "zxzx" >> "^p^p" TIWSA
                ;
                ^!SetWordWrap ^%ww%
                -------------------


                ·············································
                ºvº SL_114 created_2008.04.23_02.14.25

                Measure of SUCCESS:
                • At age 50 is...
                Having money.
                € hrs € hsavage € pobox € com
              • Mike Breiding
                ... This works, but only on docs with a blank line between paragraphs. I was afraid this might be a problem. Thanks for sending the clip! -Mike
                Message 7 of 12 , Apr 23 8:00 AM
                • 0 Attachment
                  hsavage wrote:
                  > Mike Breiding wrote:
                  > > Greetings,
                  > >
                  > > I have OCR scanned docs where line lengths vary.
                  > > I would like to have each paragraph be unbroken with.
                  > >
                  > > Is this a clip solution?
                  > >
                  > > Thanks,
                  > > -Mike
                  >
                  > Mike,
                  >
                  > Here's a short clip that should solve your problem if it's formatted as
                  > in your example.
                  > -------------------
                  > H="FormatLines"
                  > ^!Set %ww%=^$IsWordWrap$
                  > ^!SetWordWrap 0
                  > ;
                  > ^!Replace "^p^p" >> "zxzx" TIWSA
                  > ^!Select ALL
                  > ^!Menu Modify/Lines/Join Lines
                  > ^!Replace "zxzx" >> "^p^p" TIWSA
                  > ;
                  > ^!SetWordWrap ^%ww%
                  > -------------------
                  This works, but only on docs with a blank line between paragraphs. I was
                  afraid this might be a problem.

                  Thanks for sending the clip!
                  -Mike
                • Mike Breiding
                  ... Unfortunately, no blank lines between paragraphs. Thanks, -Mike
                  Message 8 of 12 , Apr 23 8:01 AM
                  • 0 Attachment
                    loro wrote:
                    > Mike Breiding wrote:
                    >
                    >> Jeff Scism wrote:
                    >>
                    >>> How about Selecting the section and using Ctrl+J (Join lines)?
                    >> his works! But, is there a way to automate the process so it joins each
                    >> paragraph seperately?
                    >>
                    >
                    > Ctrl+A Ctrl+J
                    > As long as there is at least one blank line between the blocks, that is.
                    > Lotta
                    Unfortunately, no blank lines between paragraphs.
                    Thanks,
                    -Mike
                  • Don - HtmlFixIt.com
                    ... Mike can you send me a text file directly with a sample in it. What distinguishes a paragraph? A return followed by a capital in most/all cases?? Having a
                    Message 9 of 12 , Apr 23 8:02 AM
                    • 0 Attachment
                      Mike Breiding wrote:
                      > loro wrote:
                      >> Mike Breiding wrote:
                      >>
                      >>> Jeff Scism wrote:
                      >>>
                      >>>> How about Selecting the section and using Ctrl+J (Join lines)?
                      >>> his works! But, is there a way to automate the process so it joins each
                      >>> paragraph seperately?
                      >>>
                      >> Ctrl+A Ctrl+J
                      >> As long as there is at least one blank line between the blocks, that is.
                      >> Lotta
                      > Unfortunately, no blank lines between paragraphs.
                      > Thanks,
                      > -Mike

                      Mike can you send me a text file directly with a sample in it.

                      What distinguishes a paragraph? A return followed by a capital in
                      most/all cases??

                      Having a look at the sample may do it.
                    • Jeff Scism
                      If all your Paragraphs end ina period followed by the line break (.^P) you can have teh replace command rplace all .^P with .^P^P that makes two returns
                      Message 10 of 12 , Apr 23 8:15 AM
                      • 0 Attachment
                        If all your Paragraphs end ina period followed by the line break (.^P)
                        you can have teh replace command rplace all .^P with .^P^P that makes
                        two "returns" follow each paragraph, then Run CTRL+A and Ctrl+J to Join
                        them all.


                        ^!REPLACE ".^P" >> ".^P^P" BW
                        ^!KEYBOARD CTRL+A CTRL+J

                        The BW code at the end of the first line indicates that the search
                        starts from the BOTTOM of the doc and goes UP, and the W tells it to do
                        all it finds.


                        Jeff

                        Mike Breiding wrote:
                        >
                        > Jeff Scism wrote:
                        > > How about Selecting the section and using Ctrl+J (Join lines)?
                        > >
                        > > Mike Breiding wrote:
                        > >
                        > >> Greetings,
                        > >> I have OCR scanned docs where line lengths vary.
                        > >> I would like to have each paragraph be unbroken with.
                        > >> Is this a clip solution?
                        > Hi Jeff,
                        > I did not know about Ctrl+J (Join lines) .
                        > This works! But, is there a way to automate the process so it joins each
                        > paragraph seperately?
                        > When there is a block of text like below it is easy for me to
                        > distinguish paragraphs, but how would NT find them for processing I
                        > wonder?
                        > Thanks,
                        > -Mike
                        >
                        > "I have come to rely upon, or more aptly put, resorted to, are: (1)
                        > cemeteries and: (2)
                        > railroad right-of-ways.
                        > For instance, when I was in military service during World War II, I
                        > learned a good
                        > place to look for birds was in the bigger and older cemeteries of the
                        > larger towns and
                        > cities. Many of the larger cemeteries are like and oasis surrounded by
                        > all types of
                        > urbanization. The older ones usually attract birds because of the
                        > variety and stages of
                        > plant life.
                        > The older cemeteries are usually in or near the better residential
                        > sections which
                        > generally are landscaped with some types of trees and shrubs that
                        > provide food and cover
                        > for birds and other wildlife."
                        >
                        >


                        --


                        Jeffery G. Scism, IBSSG
                        ~~

                        "Proponents of each side are vying with determination to prove their ignorance is greater than the other."

                        President Andrew Jackson, discussing a bill going through the US Congress.



                        Visit http://ibssg.org/
                        For The Blacksheep website, MORE...

                        Putnam County Indiana Biographies and Obituaries
                        http://ingenweb.org/inputnam/bios/

                        Montgomery County Indiana Biographies and Obituaries
                        http://ingenweb.org/inmontgomery/bios/

                        Fountain County Indiana Biographies and Obituaries
                        http://ingenweb.org/infountain/vitals/bios/
                      • Mike Breiding
                        ... Ah-ha!! I missed and obvious S&R opportunity there. I did the S&R ( replace .^P with .^P^P ) and then ran the FormatLines clip from hsavage ( what is
                        Message 11 of 12 , Apr 23 8:38 AM
                        • 0 Attachment
                          Jeff Scism wrote:
                          > If all your Paragraphs end ina period followed by the line break (.^P)
                          > you can have teh replace command rplace all .^P with .^P^P that makes
                          > two "returns" follow each paragraph, then Run CTRL+A and Ctrl+J to Join
                          > them all.
                          >
                          >
                          > ^!REPLACE ".^P" >> ".^P^P" BW
                          > ^!KEYBOARD CTRL+A CTRL+J
                          >
                          > The BW code at the end of the first line indicates that the search
                          > starts from the BOTTOM of the doc and goes UP, and the W tells it to do
                          > all it finds. Jeff
                          Ah-ha!! I missed and obvious S&R opportunity there.
                          I did the S&R ( replace ".^P" with ".^P^P") and then ran the
                          FormatLines clip from "hsavage" ( what is your first name "hsavage"?)
                          and it got 90% of them.
                          There are some chopped paragraphs from the sloppy ORC, but I can get
                          those manually.

                          The OCRs I have are from all kinds of documents of all ages, fonts,
                          papers qualities, etc. So there is a mixed bag of how the docs ended up
                          being formatted. Some are going to be easy, some a pain in the a**.
                          With the solutions I have now and maybe more from Don on the way this
                          will hopefully get most of it cleaned up.

                          As always, thanks for the help!!
                          -Mike
                        • hsavage
                          ... Harvey -- ············································· ºvº SL_114 created_2008.04.23_02.14.25 Measure of
                          Message 12 of 12 , Apr 23 8:41 AM
                          • 0 Attachment
                            Mike Breiding wrote:
                            > "hsavage" ( what is your first name "hsavage"?)
                            >
                            > -Mike
                            >
                            Harvey

                            --
                            ·············································
                            ºvº SL_114 created_2008.04.23_02.14.25

                            Measure of SUCCESS:
                            • At age 50 is...
                            Having money.
                            € hrs € hsavage € pobox € com
                          Your message has been successfully submitted and would be delivered to recipients shortly.