Loading ...
Sorry, an error occurred while loading the content.
 

Re: letter wrap in the middle of words

Expand Messages
  • HOGG Maynard
    ... OmegaT s PDF filter has this problem in spades because Japanese and Chinese have letter wrap. None of that sissy break at word boundaries nonsense. The
    Message 1 of 8 , Jul 1 7:13 PM
      TOn Fri, Jun 27, 2014 at 4:02 PM, HOGG Maynard <maynard.hogg@...> wrote:
      >these spaces also creep in when PDF conversion tools attempt
      > to remove carriage returns without knowing that spaces and hyphens are
      > not the only characters that appear at the ends of lines.

      > If you like the spaces first mentioned above, you can protect them
      > with something like the following. (untested)

      > s/([^ !-~]+) ([^ !-~]+)/$1$2/g

      OmegaT's PDF filter has this problem in spades because Japanese and
      Chinese have "letter wrap." None of that sissy "break at word
      boundaries" nonsense.

      The good news is that, unlike OOo, it doesn't have to worry about
      spaces in the middle of lines—i.e., legitimate ones in ASCII text and
      those pesky kerning artifacts. It can just concentrate on how to
      replace carriage returns and environs: no change (blank lines
      indicating paragraph breaks), nothing after a hyphen, and space
      everywhere else.

      The quick and dirty fix would be conflate the last two into nothing
      everywhere else. Inserting the occasional missing spaces is a small
      price to pay for the match failures across line breaks.
    • HOGG Maynard
      ... I checked Options File Filters. The PDF filter has no options. Time for an RFE?
      Message 2 of 8 , Jul 2 4:27 PM
        On Wed, Jul 2, 2014 at 11:13 AM, HOGG Maynard <maynard.hogg@...> wrote:
        > TOn Fri, Jun 27, 2014 at 4:02 PM, HOGG Maynard <maynard.hogg@...> wrote:
        >>these spaces also creep in when PDF conversion tools attempt
        >> to remove carriage returns without knowing that spaces and hyphens are
        >> not the only characters that appear at the ends of lines.

        >> If you like the spaces first mentioned above, you can protect them
        >> with something like the following. (untested)

        >> s/([^ !-~]+) ([^ !-~]+)/$1$2/g

        > OmegaT's PDF filter has this problem in spades because Japanese and
        > Chinese have "letter wrap." None of that sissy "break at word
        > boundaries" nonsense.

        > The quick and dirty fix would be conflate the last two into nothing
        > everywhere else. Inserting the occasional missing spaces is a small
        > price to pay for the match failures across line breaks.

        I checked Options > File Filters.

        The PDF filter has no options. Time for an RFE?
      • Didier Briel
        ... Yes, if you think the PDF input filter can be improved. The code is *very* simple:
        Message 3 of 8 , Jul 3 12:53 AM
          > -----Original Message-----
          > From: OmegaT@yahoogroups.com [mailto:OmegaT@yahoogroups.com]
          > Sent: Thursday, July 03, 2014 1:27 AM
          > To: omegat
          > Subject: [OmT] Re: letter wrap in the middle of words
          >
          > > OmegaT's PDF filter has this problem in spades because Japanese and
          > > Chinese have "letter wrap." None of that sissy "break at word
          > > boundaries" nonsense.
          >
          > > The quick and dirty fix would be conflate the last two into nothing
          > > everywhere else. Inserting the occasional missing spaces is a small
          > > price to pay for the match failures across line breaks.
          >
          > I checked Options > File Filters.
          >
          > The PDF filter has no options. Time for an RFE?

          Yes, if you think the PDF input filter can be improved.
          The code is *very* simple:
          https://sourceforge.net/p/omegat/svn/HEAD/tree/trunk/src/org/omegat/filters2/pdf/PdfFilter.java

          Of course, it depends a lot on what the PDF library is doing (notably PDFTextStripper).

          Didier


          >
          >
          > ------------------------------------
          > Posted by: HOGG Maynard <maynard.hogg@...>
          > ------------------------------------
          >
          > The OmegaT Project Philosophy:
          > http://www.omegat.org/en/philosophy.html
          > The OmegaT Project and You:
          > http://www.omegat.org/en/involved.html
          >
          > OmegaT contributors should join the "omegat-development" group OmegaT
          > localizers should join the "omegat-l10n" group
          > http://sourceforge.net/mail/?group_id=68187
          >
          > IRC channel: http://java.freenode.net//index.php?channel=omegat
          > or: irc://irc.freenode.net/omegat
          > Bug reports, feature requests, OmegaT test versions etc...:
          > http://sourceforge.net/projects/omegat/
          >
          > ------------------------------------
          >
          >
          > ------------------------------------
          >
          > Yahoo Groups Links
          >
          >
          >
        • HOGG Maynard
          On Thu, Jul 3, 2014 at 4:53 PM, Didier Briel d.briel@free.fr ... I posted my RFE this morning. All that should be needed is a few tweaks to the two
          Message 4 of 8 , Jul 3 2:06 AM
            On Thu, Jul 3, 2014 at 4:53 PM, 'Didier Briel' d.briel@...
            [OmegaT] <OmegaT@yahoogroups.com> wrote:
            >> I checked Options > File Filters.
            >> The PDF filter has no options. Time for an RFE?

            > Yes, if you think the PDF input filter can be improved.
            > The code is *very* simple:
            > https://sourceforge.net/p/omegat/svn/HEAD/tree/trunk/src/org/omegat/filters2/pdf/PdfFilter.java

            I posted my RFE this morning.

            All that should be needed is a few tweaks to the two sb.append() calls
            to switch the “glue” between a space and an empty string as
            appropriate for the characters before and/or after the unwanted
            carriage return.

            > Of course, it depends a lot on what the PDF library is doing (notably PDFTextStripper).

            One baby step at a time. I'm not sure what can be done about tables of
            numbers, for example.
          • Didier Briel
            ... I don t see it https://sourceforge.net/p/omegat/feature-requests/ Do you have the URL? ... If you can provide the code, do not hesitate to do so. Didier
            Message 5 of 8 , Jul 3 2:53 AM
              > -----Original Message-----
              > From: OmegaT@yahoogroups.com [mailto:OmegaT@yahoogroups.com]
              > Sent: Thursday, July 03, 2014 11:06 AM
              > To: omegat
              > Subject: Re: [OmT] Re: letter wrap in the middle of words
              >
              > On Thu, Jul 3, 2014 at 4:53 PM, 'Didier Briel' d.briel@... [OmegaT]
              > <OmegaT@yahoogroups.com> wrote:
              > >> I checked Options > File Filters.
              > >> The PDF filter has no options. Time for an RFE?
              >
              > > Yes, if you think the PDF input filter can be improved.
              > > The code is *very* simple:
              > >
              > https://sourceforge.net/p/omegat/svn/HEAD/tree/trunk/src/org/omegat/fi
              > > lters2/pdf/PdfFilter.java
              >
              > I posted my RFE this morning.

              I don't see it
              https://sourceforge.net/p/omegat/feature-requests/

              Do you have the URL?

              > All that should be needed is a few tweaks to the two sb.append() calls to
              > switch the “glue” between a space and an empty string as appropriate for
              > the characters before and/or after the unwanted carriage return.

              If you can provide the code, do not hesitate to do so.

              Didier
            • HOGG Maynard
              On Thu, Jul 3, 2014 at 6:53 PM, Didier Briel d.briel@free.fr ... It s gone. We are not here. We are not having this conversation. ... I ve changed my mind.
              Message 6 of 8 , Jul 3 11:30 PM
                On Thu, Jul 3, 2014 at 6:53 PM, 'Didier Briel' d.briel@...
                [OmegaT] <OmegaT@yahoogroups.com> wrote:
                >> I posted my RFE this morning.

                > I don't see it
                > https://sourceforge.net/p/omegat/feature-requests/
                > Do you have the URL?

                It's gone. "We are not here. We are not having this conversation."

                >> All that should be needed is a few tweaks to the two sb.append() calls to
                >> switch the “glue” between a space and an empty string as appropriate for
                >> the characters before and/or after the unwanted carriage return.

                > If you can provide the code, do not hesitate to do so.

                I've changed my mind.

                ASCII before and after: space
                UHan before and after: no space
                Mixes: User's choice

                Commenting out the second sb.append() call would do it for Unified Han users.
              • Didier Briel
                ... It would still require an RFE explaining the name (e.g., Do not add a space to replace empty lines) and the rationale behind the option. Didier
                Message 7 of 8 , Jul 4 12:16 AM
                  > -----Original Message-----
                  > From: OmegaT@yahoogroups.com [mailto:OmegaT@yahoogroups.com]
                  > Sent: Friday, July 04, 2014 8:31 AM
                  > To: omegat
                  > Subject: Re: [OmT] Re: letter wrap in the middle of words
                  >
                  > On Thu, Jul 3, 2014 at 6:53 PM, 'Didier Briel' d.briel@... [OmegaT]
                  > <OmegaT@yahoogroups.com> wrote:


                  > >> All that should be needed is a few tweaks to the two sb.append()
                  > >> calls to switch the “glue” between a space and an empty string as
                  > >> appropriate for the characters before and/or after the unwanted carriage
                  > return.
                  >
                  > > If you can provide the code, do not hesitate to do so.
                  >
                  > I've changed my mind.
                  >
                  > ASCII before and after: space
                  > UHan before and after: no space
                  > Mixes: User's choice
                  >
                  > Commenting out the second sb.append() call would do it for Unified Han
                  > users.

                  It would still require an RFE explaining the name (e.g., Do not add a space to replace empty lines) and the rationale behind the option.

                  Didier


                  >
                  >
                  > ------------------------------------
                  > Posted by: HOGG Maynard <maynard.hogg@...>
                  > ------------------------------------
                  >
                  > The OmegaT Project Philosophy:
                  > http://www.omegat.org/en/philosophy.html
                  > The OmegaT Project and You:
                  > http://www.omegat.org/en/involved.html
                  >
                  > OmegaT contributors should join the "omegat-development" group OmegaT
                  > localizers should join the "omegat-l10n" group
                  > http://sourceforge.net/mail/?group_id=68187
                  >
                  > IRC channel: http://java.freenode.net//index.php?channel=omegat
                  > or: irc://irc.freenode.net/omegat
                  > Bug reports, feature requests, OmegaT test versions etc...:
                  > http://sourceforge.net/projects/omegat/
                  >
                  > ------------------------------------
                  >
                  >
                  > ------------------------------------
                  >
                  > Yahoo Groups Links
                  >
                  >
                  >
                • HOGG Maynard
                  ... Make that “would be a step forward.” Going beyond an “80%” solution gets complicated. I ve changed my mind. ... Future issue: Lines ending with a
                  Message 8 of 8 , Jul 4 9:42 PM
                    On Fri, Jul 4, 2014 at 3:30 PM, HOGG Maynard <maynard.hogg@...> wrote:
                    > Commenting out the second sb.append() call would do it for Unified Han users.

                    Make that “would be a step forward.” Going beyond an “80%” solution
                    gets complicated.> I've changed my mind.

                    > ASCII before and after: space

                    Future issue: Lines ending with a hyphen. Hard? soft? minus sign?
                    dash? other? And that's just for English.

                    > Mixed: User's choice

                    Leaving spaces doesn't seem to affect glossary matching—i.e., an
                    UHan/ASCII boundary has a much higher probability of being a word
                    boundary.

                    In my case. Perhaps readers can provide examples like 抗histamine
                    (抗ヒスタミン) for antihistamine.

                    > UHan before and after: no space

                    Little problem here. “Japanese doesn't have words. Like Chinese, it's
                    written as a continuous stream of characters...”

                    BUT....

                    While I was on my way to the supermarket, I remembered the pesky
                    Japanese rule (sarcsasm) against double delimiters. Here, ます or です or
                    ません at the end of the line indicates the end of a sentence—i.e., the 。
                    or whatever the author uses between sentences is optional. Figuring
                    out whether this break is one between paragraphs (\n) or between
                    sentences (。or .) is far beyond the scope of this filter, so I vote
                    for \n.

                    Alternatively, there could another user-specified option for "line
                    break appears to be sentence break": \n, 。, ., 。\n, .\n, whatever.

                    Note: The above examples are just the three most frequently occurring
                    sentence endings and are *unambiguously* so. Other candidates are
                    either too literary for ordinary work or flexible enough to appear in
                    the middle of sentences. People who use ものとする know that the 。(never .)
                    is not optional.
                  Your message has been successfully submitted and would be delivered to recipients shortly.