Loading ...
Sorry, an error occurred while loading the content.
 

Re: mbyte.c patch

Expand Messages
  • Bram Moolenaar
    ... Looks like a good bunch of changes. I ll include the patch. Question: You changed UTF-16 to UCS-2 in several places. Are you sure this is correct? I
    Message 1 of 14 , Jul 14, 2002
      Glenn Maynard wrote:

      > Attached is a mostly cosmetic patch to mbyte.c. None of this affects
      > the clipboard patch, so I've separated it out.

      Looks like a good bunch of changes. I'll include the patch.

      Question: You changed UTF-16 to UCS-2 in several places. Are you sure
      this is correct? I thought that MS-Windows does use UTF-16. For the
      code it doesn't really matter, I suppose, since it's still using
      two-byte words.

      --
      DENNIS: Look, strange women lying on their backs in ponds handing out
      swords ... that's no basis for a system of government. Supreme
      executive power derives from a mandate from the masses, not from some
      farcical aquatic ceremony.
      "Monty Python and the Holy Grail" PYTHON (MONTY) PICTURES LTD

      /// Bram Moolenaar -- Bram@... -- http://www.moolenaar.net \\\
      /// Creator of Vim -- http://vim.sf.net -- ftp://ftp.vim.org/pub/vim \\\
      \\\ Project leader for A-A-P -- http://www.a-a-p.org ///
      \\\ Help me helping AIDS orphans in Uganda - http://iccf-holland.org ///
    • Glenn Maynard
      ... Hmm. You re probably right; thinking about it, I m really not sure. ... For this code, it shouldn t, since it s only used for the IME which probably won t
      Message 2 of 14 , Jul 14, 2002
        On Sun, Jul 14, 2002 at 12:51:38PM +0200, Bram Moolenaar wrote:
        > Question: You changed UTF-16 to UCS-2 in several places. Are you sure
        > this is correct? I thought that MS-Windows does use UTF-16.

        Hmm. You're probably right; thinking about it, I'm really not sure.

        > For the
        > code it doesn't really matter, I suppose, since it's still using
        > two-byte words.

        For this code, it shouldn't, since it's only used for the IME which
        probably won't generate anything that'll translate outside of the BMP.
        (Actually, is there anything at all within the regular Windows codepages
        that's outside the BMP?) Still, it should be commented correctly to
        avoid confusion later on.

        Also, the code setting up the IME converters:

        convert_setup(&ime_conv, "ucs-2", p_enc);
        ime_conv_cp.vc_type = CONV_DBCS_TO_UCS2;
        ime_conv_cp.vc_dbcs = GetACP();
        ime_conv_cp.vc_factor = 2; /* we don't really know anything about the codepage */

        This works if p_enc is Unicode or latin1 (due to the special cases), but
        I don't think it'll work if it's anything else (ie. cp932), since it'll
        fall back on iconv and that'll force "UCS-2" to "UTF-8".

        Now, setting encoding to anything but UTF-8 or latin1 currently doesn't
        work anyway: it doesn't render correctly. Is that intended to work?
        If not, encoding should probably reject other settings. It'd simplify
        a lot of things if the internal encoding was always UTF-8 in Windows.
        I think you mentioned this idea before.

        --
        Glenn Maynard
      • Bram Moolenaar
        ... As far as I know, MS-Windows only supports UCS-2 so far, but since they finally discovered that 16 bits is not enough (just like 640 Kbyte wasn t enough!
        Message 3 of 14 , Jul 14, 2002
          Glenn Maynard wrote:

          > On Sun, Jul 14, 2002 at 12:51:38PM +0200, Bram Moolenaar wrote:
          > > Question: You changed UTF-16 to UCS-2 in several places. Are you sure
          > > this is correct? I thought that MS-Windows does use UTF-16.
          >
          > Hmm. You're probably right; thinking about it, I'm really not sure.
          >
          > > For the
          > > code it doesn't really matter, I suppose, since it's still using
          > > two-byte words.
          >
          > For this code, it shouldn't, since it's only used for the IME which
          > probably won't generate anything that'll translate outside of the BMP.
          > (Actually, is there anything at all within the regular Windows codepages
          > that's outside the BMP?) Still, it should be commented correctly to
          > avoid confusion later on.

          As far as I know, MS-Windows only supports UCS-2 so far, but since they
          finally discovered that 16 bits is not enough (just like 640 Kbyte wasn't
          enough! :-), they found the UTF-16 hack to work around it. A really
          ugly solution compared to UTF-8. I don't know how much of MS-Windows
          currently actually supports UTF-16.

          > Also, the code setting up the IME converters:
          >
          > convert_setup(&ime_conv, "ucs-2", p_enc);
          > ime_conv_cp.vc_type = CONV_DBCS_TO_UCS2;
          > ime_conv_cp.vc_dbcs = GetACP();
          > ime_conv_cp.vc_factor = 2; /* we don't really know anything about the codepage */
          >
          > This works if p_enc is Unicode or latin1 (due to the special cases), but
          > I don't think it'll work if it's anything else (ie. cp932), since it'll
          > fall back on iconv and that'll force "UCS-2" to "UTF-8".

          Hmm, the call to iconv probably has to be fixed for this. Sounds like
          we need an extra flag in vimconv_T that indicates if any Unicode is
          handled as UTF-8 or not. Perhaps this could also be handled when
          filling vimconv_T, we don't need the flag then.

          > Now, setting encoding to anything but UTF-8 or latin1 currently doesn't
          > work anyway: it doesn't render correctly. Is that intended to work?
          > If not, encoding should probably reject other settings. It'd simplify
          > a lot of things if the internal encoding was always UTF-8 in Windows.
          > I think you mentioned this idea before.

          Setting 'encoding' to some Asian codepage should certainly work. Korean
          and Japanese users couldn't work without this.

          Unfortunately we can't drop all kinds of encodings and use UTF-8,
          conversion from/to Unicode will not always be possible. There is the
          famous yen vs backslash problem, for example.

          --
          BLACK KNIGHT: I'm invincible!
          ARTHUR: You're a looney.
          "Monty Python and the Holy Grail" PYTHON (MONTY) PICTURES LTD

          /// Bram Moolenaar -- Bram@... -- http://www.moolenaar.net \\\
          /// Creator of Vim -- http://vim.sf.net -- ftp://ftp.vim.org/pub/vim \\\
          \\\ Project leader for A-A-P -- http://www.a-a-p.org ///
          \\\ Help me helping AIDS orphans in Uganda - http://iccf-holland.org ///
        • Glenn Maynard
          gui_w32 uses ETO_IGNORELANGUAGE in NT; the only doc I can find for this flag says: Reserved for system use. If an application sets this flag, it loses
          Message 4 of 14 , Jul 14, 2002
            gui_w32 uses ETO_IGNORELANGUAGE in NT; the only doc I can find for this
            flag says:

            "Reserved for system use. If an application sets this flag, it loses
            international scripting support and in some cases it may display no text
            at all."

            There's nothing to say why this flag was used, and it's not used
            consistently. Does anyone know?

            On Sun, Jul 14, 2002 at 09:46:56PM +0200, Bram Moolenaar wrote:
            > Setting 'encoding' to some Asian codepage should certainly work. Korean
            > and Japanese users couldn't work without this.

            Aha. The HanExtTextOut code breaks this. If I comment that out, it
            ends up coming to the "is_funky_dbcs" renderer, which renders correctly
            via Unicode.

            I don't see why they can't work without it--there should be no
            functional difference if encoding=utf-8. Can you (or any CJK user
            present) give an example? The "is_funky_dbcs" code should work fine in
            all Win32. (The only thing that wouldn't work right now is the
            alternate HanExtTextOut stuff, but that can be done in Unicode too, and
            needs fixing anyway.)

            > Unfortunately we can't drop all kinds of encodings and use UTF-8,
            > conversion from/to Unicode will not always be possible.

            If there was no way to convert the input to Unicode, then there would be no
            way to render it.

            The only two formats you can render in Windows are Unicode text and
            ANSI-codepage text (ExtTextOutW and ExtTextOut, respectively)--and you
            can always convert from the ANSI codepage to Unicode via
            MultiByteToWideChar. MS has been helpful (for once) in making both of
            these functions available in Win9x.

            The only other part is file I/O--if iconv isn't available, MBToWC and
            WCToMB need to be used when reading and writing files if the file
            encoding is a codepage. I don't know if this is currently done.

            > There is the
            > famous yen vs backslash problem, for example.

            Actually, that's not related.

            The yen/backslash problem is mostly related to external data; how to
            migrate from CP932 to Unicode, and interoperability with CP932 files
            on other systems.

            There's no difference, however, between rendering 0x5C with ExtTextOut
            and U+005C with ExtTextOutW; if the yen/backslash problem exists, it'll
            be the same with both.

            I notice there's a small hack in HanExtTextOut to render 0x5C as a
            backslash in all DBCS locales, which only works in a specific font size.
            This would need to be rewritten--but it needs to be rewritten anyway,
            since it's 1: breaking "is_funky_dbcs", 2: kicks in when it shouldn't
            (it doesn't apply to all DBCS), 3: can be done better, by flipping a
            forward slash vertically, and 4: should be optional, as some Japanese
            users will *want* 0x5C to be a Yen symbol. This is the same thing I
            mentioned a while back, and something I still intend to implement
            (port, really, from Putty). I'd also need to find out exactly when the
            partially-composed-hangul code kicks in, since it's not happening with
            the Win2K IME.

            --
            Glenn Maynard
          • Glenn Maynard
            To amend a bit: The funky_dbcs code works, but it s incomplete: padding isn t specified, so bold and italic characters desync the output a bit, and len isn t
            Message 5 of 14 , Jul 14, 2002
              To amend a bit:

              The funky_dbcs code works, but it's incomplete: padding isn't specified,
              so bold and italic characters desync the output a bit, and len isn't
              set, so underlining is wrong. I'm fixing this.

              --
              Glenn Maynard
            • Ron Aaron
              ... Yes. If the ETO_IGNORELANGUAGE flag is set on NT (2K, XP, etc) then we tell the OS not to be helpful when outputting characters which are in Hebrew or
              Message 6 of 14 , Jul 14, 2002
                Glenn Maynard <glenn@...> writes:
                >gui_w32 uses ETO_IGNORELANGUAGE in NT; the only doc I can find for this
                >flag says:
                >
                >"Reserved for system use. If an application sets this flag, it loses
                >international scripting support and in some cases it may display no text
                >at all."
                >
                >There's nothing to say why this flag was used, and it's not used
                >consistently. Does anyone know?

                Yes. If the ETO_IGNORELANGUAGE flag is set on NT (2K, XP, etc) then we tell
                the OS not to be "helpful" when outputting characters which are in Hebrew or
                Arabic fonts. Without this flag, vim displays Hebrew and Arabic incorrectly
                on the new NT systems. This is because of the helpful, helpful "Uniscribe"
                APIs, which "do what you mean" according to MS.

                Setting this flag for the '9x series of Windows (95, 98, Me) doesn't work and
                indeed inhibits displaying multilanguage documents correctly.

                So there is a different, much more painful, code path for the 9x series vs.
                the NT series.

                Hope this helps,
                Ron
              • Glenn Maynard
                ... Yeah, I just figured that out. I m trying to figure out exactly what works, and why. For example, this isn t done for UTF-8. The ETO_IGNORELANGUAGE flag
                Message 7 of 14 , Jul 14, 2002
                  On Sun, Jul 14, 2002 at 03:58:47PM -0700, Ron Aaron wrote:
                  > Yes. If the ETO_IGNORELANGUAGE flag is set on NT (2K, XP, etc) then we tell
                  > the OS not to be "helpful" when outputting characters which are in Hebrew or
                  > Arabic fonts. Without this flag, vim displays Hebrew and Arabic incorrectly
                  > on the new NT systems. This is because of the helpful, helpful "Uniscribe"
                  > APIs, which "do what you mean" according to MS.
                  >
                  > Setting this flag for the '9x series of Windows (95, 98, Me) doesn't work and
                  > indeed inhibits displaying multilanguage documents correctly.
                  >
                  > So there is a different, much more painful, code path for the 9x series vs.
                  > the NT series.

                  Yeah, I just figured that out. I'm trying to figure out exactly what
                  works, and why.

                  For example, this isn't done for UTF-8. The ETO_IGNORELANGUAGE flag is
                  still set, but the 9x special case needs to be done for Unicode, too.
                  (Don't forget, internal Unicode support works fine in Win9X.) It could
                  be done without relying on the codepage by scanning for characters in
                  the Hebrew range. U+0590-U+05FF? (Looks like Windows also uses U+FB20-
                  FB4F. I wonder if there's any way to query this reliably ...)

                  encoding=cp1255 doesn't set up as DBCS. (It's not a DBCS, but it needs
                  the same code path.) It should enter the "is_funky_dbcs" code, but instead
                  it's passed literally to ExtTextOut. As a result, "encoding=cp1255" only
                  works on systems whose actual ACP is 1255. (I think all the "DBCS" code
                  should really be called "Windows codepage" code, since it's applicable to
                  all codepages, not just DBCS.)

                  It renders fine if I convince it to treat cp1255 as a DBCS encoding, but
                  I don't know if that might have other side-effects.

                  --
                  Glenn Maynard
                • Bram Moolenaar
                  ... It seems defining FEAT_MBYTE_IME changes the method used for outputting DBCS text. That is not a good thing, deciding to use HanExtTextOut() should also
                  Message 8 of 14 , Jul 15, 2002
                    Glenn Maynard wrote:

                    > On Sun, Jul 14, 2002 at 09:46:56PM +0200, Bram Moolenaar wrote:
                    > > Setting 'encoding' to some Asian codepage should certainly work. Korean
                    > > and Japanese users couldn't work without this.
                    >
                    > Aha. The HanExtTextOut code breaks this. If I comment that out, it
                    > ends up coming to the "is_funky_dbcs" renderer, which renders correctly
                    > via Unicode.

                    It seems defining FEAT_MBYTE_IME changes the method used for outputting
                    DBCS text. That is not a good thing, deciding to use HanExtTextOut()
                    should also be based on runtime conditions. Perhaps checking
                    "is_funky_dbcs" will help? Perhaps curwin->w_p_rl also needs to be
                    checked? Hmm, HanExtTextOut() itself just calls ExtTextOut() when the
                    character size is different from the "sysfixed" size. That suggests
                    this check can be moved to before calling HanExtTextOut().

                    > I don't see why they can't work without it--there should be no
                    > functional difference if encoding=utf-8. Can you (or any CJK user
                    > present) give an example? The "is_funky_dbcs" code should work fine in
                    > all Win32. (The only thing that wouldn't work right now is the
                    > alternate HanExtTextOut stuff, but that can be done in Unicode too, and
                    > needs fixing anyway.)

                    Don't trust what "should work" on MS-Windows! :-) Real user experience
                    overrules everything else. Therefore every change must be tested in
                    several conditions. Unfortunately, the code doesn't always say why it
                    was needed...

                    > > Unfortunately we can't drop all kinds of encodings and use UTF-8,
                    > > conversion from/to Unicode will not always be possible.
                    >
                    > If there was no way to convert the input to Unicode, then there would be no
                    > way to render it.
                    >
                    > The only two formats you can render in Windows are Unicode text and
                    > ANSI-codepage text (ExtTextOutW and ExtTextOut, respectively)--and you
                    > can always convert from the ANSI codepage to Unicode via
                    > MultiByteToWideChar. MS has been helpful (for once) in making both of
                    > these functions available in Win9x.

                    But do they fully work? I don't know the details, but there are some
                    problems with conversion from/to Unicode to/from MS-Windows codepages.
                    You would lose information when using the clipboard, for example.

                    > The only other part is file I/O--if iconv isn't available, MBToWC and
                    > WCToMB need to be used when reading and writing files if the file
                    > encoding is a codepage. I don't know if this is currently done.

                    No, but it is one of the things to implement to avoid having to install
                    iconv.

                    > > There is the famous yen vs backslash problem, for example.
                    >
                    > Actually, that's not related.

                    I'm not sure. I know quite a few people don't like iconv, because it
                    does the conversion by the book instead of how the user expects it to
                    be done.

                    Anyway, the problem with conversions always is the risc of losing
                    information. Unicode was designed to avoid that, but implementation
                    problems may still cause trouble. Avoiding conversion avoids potential
                    trouble. Perhaps I'm paranoid...

                    > I notice there's a small hack in HanExtTextOut to render 0x5C as a
                    > backslash in all DBCS locales, which only works in a specific font size.
                    > This would need to be rewritten--but it needs to be rewritten anyway,
                    > since it's 1: breaking "is_funky_dbcs", 2: kicks in when it shouldn't
                    > (it doesn't apply to all DBCS), 3: can be done better, by flipping a
                    > forward slash vertically, and 4: should be optional, as some Japanese
                    > users will *want* 0x5C to be a Yen symbol. This is the same thing I
                    > mentioned a while back, and something I still intend to implement
                    > (port, really, from Putty). I'd also need to find out exactly when the
                    > partially-composed-hangul code kicks in, since it's not happening with
                    > the Win2K IME.

                    We might as well do nothing. Most Japanese users are used to seeing the
                    yen symbol in the place of a backslash anyway.

                    --
                    Team-building exercises come in many forms but they all trace their roots back
                    to the prison system. In your typical team-building exercise the employees
                    are subjected to a variety of unpleasant situations until they become either a
                    cohesive team or a ring of car jackers.
                    (Scott Adams - The Dilbert principle)

                    /// Bram Moolenaar -- Bram@... -- http://www.moolenaar.net \\\
                    /// Creator of Vim -- http://vim.sf.net -- ftp://ftp.vim.org/pub/vim \\\
                    \\\ Project leader for A-A-P -- http://www.a-a-p.org ///
                    \\\ Help me helping AIDS orphans in Uganda - http://iccf-holland.org ///
                  • Glenn Maynard
                    ... That s what I ended up doing. However, notice that one of the HanExtTextOut calls only does anything when ImeGetTempComposition returns non-NULL, and that
                    Message 9 of 14 , Jul 15, 2002
                      On Mon, Jul 15, 2002 at 10:32:44PM +0200, Bram Moolenaar wrote:
                      > It seems defining FEAT_MBYTE_IME changes the method used for outputting
                      > DBCS text. That is not a good thing, deciding to use HanExtTextOut()
                      > should also be based on runtime conditions. Perhaps checking
                      > "is_funky_dbcs" will help? Perhaps curwin->w_p_rl also needs to be
                      > checked? Hmm, HanExtTextOut() itself just calls ExtTextOut() when the
                      > character size is different from the "sysfixed" size. That suggests
                      > this check can be moved to before calling HanExtTextOut().

                      That's what I ended up doing.

                      However, notice that one of the HanExtTextOut calls only does anything when
                      ImeGetTempComposition returns non-NULL, and that only happens when
                      bInComposition is TRUE.

                      01:35am glenn@.../2 [~/vim/src] grep bInComposition *.c
                      gui_w32.c:static BOOL bInComposition=FALSE;
                      gui_w32.c: if (bInComposition == TRUE)

                      It looks like this was broken at some point. It's apparently written to
                      try to display a partially-composed Korean character with the Korean IME
                      in Windows.

                      I don't remember if this code predates the newer IME code; it might no
                      longer be necessary, now that the IME is being set up correctly.
                      (Perhaps the fact that it's not working and nobody's complained is
                      evidence?)

                      > We might as well do nothing. Most Japanese users are used to seeing the
                      > yen symbol in the place of a backslash anyway.

                      I view Japanese text in Vim daily, and I don't like seeing a backslash
                      there.

                      The backslash-fix code is a trivial addition to my font-repair code, which
                      has other uses; in particular, fixing Unicode quotes in Japanese fonts.
                      So I, at least, don't mind if the current backslash fix code goes away, since
                      I have a replacement for it; and the current code is extremely special case
                      (system-font-size only?), to the point of not really being useful.
                      (And, like you said, if it *does* kick in, some users probably won't
                      want it.)

                      > But do they fully work? I don't know the details, but there are some
                      > problems with conversion from/to Unicode to/from MS-Windows codepages.
                      > You would lose information when using the clipboard, for example.

                      If you convert with iconv you'll have problems. For example, if you edit
                      a CP932 file by converting it with iconv on load, and then send something
                      to the clipboard, and another application pastes that back in CP932 using
                      the system conversions (MBtoWC), you'll lose data since the conversions
                      aren't identical. Round-trip fails.

                      The solution to that is to use the system's conversions instead of iconv
                      for codepages. You're always using iconv if it's available (in fileio),
                      and never using MultiByteToWideChar in Windows. I'd blame Unicode problems
                      in Windows on that.

                      The only way I can see system conversions causing problems is if round-trip
                      conversion fails. I don't know of any cases of this, but if anyone does I'd
                      definitely like to know.

                      > Anyway, the problem with conversions always is the risc of losing
                      > information. Unicode was designed to avoid that, but implementation
                      > problems may still cause trouble. Avoiding conversion avoids potential
                      > trouble. Perhaps I'm paranoid...

                      > Don't trust what "should work" on MS-Windows! :-) Real user experience
                      > overrules everything else. Therefore every change must be tested in
                      > several conditions. Unfortunately, the code doesn't always say why it
                      > was needed...

                      Well, it's right to be paranoid in Windows. That's why we have this mailing
                      list. :)

                      I'll work on making fileio use MBtoWC for codepage<->Unicode conversion
                      when possible. Even if you leave encodings as is, this will help.

                      --
                      Glenn Maynard
                    • Bram Moolenaar
                      ... This indeed looks very broken. ImeGetTempComposition() will always return NULL. It already was like this in 6.0. I didn t hear specific complaints about
                      Message 10 of 14 , Jul 17, 2002
                        Glenn Maynard wrote:

                        > However, notice that one of the HanExtTextOut calls only does anything when
                        > ImeGetTempComposition returns non-NULL, and that only happens when
                        > bInComposition is TRUE.
                        >
                        > 01:35am glenn@.../2 [~/vim/src] grep bInComposition *.c
                        > gui_w32.c:static BOOL bInComposition=FALSE;
                        > gui_w32.c: if (bInComposition == TRUE)
                        >
                        > It looks like this was broken at some point. It's apparently written to
                        > try to display a partially-composed Korean character with the Korean IME
                        > in Windows.
                        >
                        > I don't remember if this code predates the newer IME code; it might no
                        > longer be necessary, now that the IME is being set up correctly.
                        > (Perhaps the fact that it's not working and nobody's complained is
                        > evidence?)

                        This indeed looks very broken. ImeGetTempComposition() will always
                        return NULL. It already was like this in 6.0. I didn't hear specific
                        complaints about this. I suppose this means we might as well remove
                        this code.

                        > > We might as well do nothing. Most Japanese users are used to seeing the
                        > > yen symbol in the place of a backslash anyway.
                        >
                        > I view Japanese text in Vim daily, and I don't like seeing a backslash
                        > there.

                        Oh yes, the other way around it's not so nice.

                        > The backslash-fix code is a trivial addition to my font-repair code, which
                        > has other uses; in particular, fixing Unicode quotes in Japanese fonts.
                        > So I, at least, don't mind if the current backslash fix code goes
                        > away, since I have a replacement for it; and the current code is
                        > extremely special case (system-font-size only?), to the point of not
                        > really being useful. (And, like you said, if it *does* kick in, some
                        > users probably won't want it.)

                        It even uses floating point numbers. Let's get rid of this.

                        > The solution to that is to use the system's conversions instead of iconv
                        > for codepages. You're always using iconv if it's available (in fileio),
                        > and never using MultiByteToWideChar in Windows. I'd blame Unicode problems
                        > in Windows on that.
                        >
                        > The only way I can see system conversions causing problems is if round-trip
                        > conversion fails. I don't know of any cases of this, but if anyone does I'd
                        > definitely like to know.

                        Perhaps it goes wrong when the codepage is set wrong? This is
                        especially for 8-bit codepages where you probably don't notice the
                        mistake if you do use the right font. Conversion to Unicode will reveal
                        the problem.

                        > I'll work on making fileio use MBtoWC for codepage<->Unicode conversion
                        > when possible. Even if you leave encodings as is, this will help.

                        That is a good thing. More people will be able to use different
                        encodings "out of the box".

                        --
                        Engineers will go without food and hygiene for days to solve a problem.
                        (Other times just because they forgot.)
                        (Scott Adams - The Dilbert principle)

                        /// Bram Moolenaar -- Bram@... -- http://www.moolenaar.net \\\
                        /// Creator of Vim -- http://vim.sf.net -- ftp://ftp.vim.org/pub/vim \\\
                        \\\ Project leader for A-A-P -- http://www.a-a-p.org ///
                        \\\ Help me helping AIDS orphans in Uganda - http://iccf-holland.org ///
                      • Glenn Maynard
                        On Wed, Jul 17, 2002 at 12:54:07PM +0200, Bram Moolenaar wrote: This indeed looks very broken. ImeGetTempComposition() will always return NULL. It
                        Message 11 of 14 , Jul 17, 2002
                          On Wed, Jul 17, 2002 at 12:54:07PM +0200, Bram Moolenaar wrote:
                          > This indeed looks very broken. ImeGetTempComposition() will always
                          > return NULL. It already was like this in 6.0. I didn't hear specific
                          > complaints about this. I suppose this means we might as well remove
                          > this code.

                          Alright. Patch attached:

                          gui_w32.c:
                          Remove HanExtTextOut, bInComposition, ImeGetTempComposition,
                          DisplayCompStringOpaque.

                          Removed a now-unneeded pair of braces from gui_mch_draw_string. I
                          didn't unindent the block of code between it, since that'd bloat the
                          patch with stuff that looks like changes but isn't. (I wish patch
                          could handle this better.) I'll let you do that.

                          Remaining problems in the renderer:

                          SBCS encodings should render with "is_funky_dbcs"; right now we'll get
                          garbage unless the encoding matches the system.

                          The Arabic/Hebrew hack isn't always done. If we pass that range of text
                          to the renderer as a block, we'll get weird results. Even if we don't
                          have RL support enabled (or even compiled), and we're in Unicode, and the
                          Hebrew text wouldn't look right to a Hebrew user, we still might be a
                          regular user editing mixed-language text; we don't want weird things
                          happening.

                          len isn't set right in is_funky_dbcs; wide characters cause underlining
                          glitches.

                          Padding isn't set right in Unicode or is_funky_dbcs, so bold and italic
                          characters cause spacing glitches.

                          Also in this patch:

                          gui_w32.c:
                          Move padding computation to a function (set_padding).

                          Move ETO_IGNORELANGUAGE setting out of each individual call, to
                          make sure it's always used.

                          Set padding in Unicode.

                          (I'd have put these in a separate patch, but they're working on the same
                          block of code.)

                          This fixes bold/italics in Unicode, and simplifies the renderer a
                          little.

                          I think these problems are ultimately because there are too many code
                          paths. I'd suggest always converting the string to Unicode at the start
                          of the function. (At least in NT, it shouldn't be a speed hit, since I
                          believe the font API will convert to Unicode anyway.) The RL (RevOut)
                          special case needs to be done in Unicode before this will work. (Which
                          I have code for, but am holding off on in the interests of keeping the
                          patch down.)

                          If this was done, the rest of the above problems would just go away.

                          > > The only way I can see system conversions causing problems is if round-trip
                          > > conversion fails. I don't know of any cases of this, but if anyone does I'd
                          > > definitely like to know.
                          >
                          > Perhaps it goes wrong when the codepage is set wrong? This is
                          > especially for 8-bit codepages where you probably don't notice the
                          > mistake if you do use the right font. Conversion to Unicode will reveal
                          > the problem.

                          It'd cause problems with the clipboard, though.

                          Hmm. I think the default encodings could be improved a bit. For
                          example, if a user tries to load a file, and it fails, they're likely
                          to first change "encoding". Of course, changing that alone isn't
                          correct, but it may happen to work. Or, they may change just
                          "fileencodings", which is closer, but that probably won't work, since
                          you can't convert most other encodings to latin1. You have to change two
                          values, and it takes a bit of reading to figure out exactly what to do.
                          (Enough that a lot of users are likely to get it wrong.)

                          First, I'd change the default "encodings" to UTF-8 in Windows. "latin1"
                          is only a reasonable default if the system encoding happens to be one
                          that's like latin1; UTF-8 is almost always a better default (especially
                          since most users should be able to leave it alone.)

                          Second, I'd change the default Unicode fencs from "ucs-bom,utf-8,latin1"
                          to "ucs-bom,utf-8,CP####" in Windows. Again, latin1 is only a
                          reasonable default for latin1 users; setting it intelligently should
                          work for a lot more people without changes. (I think this should let
                          most people edit locally-encoded files without having to touch encoding-
                          related settings at all.)

                          > > I'll work on making fileio use MBtoWC for codepage<->Unicode conversion
                          > > when possible. Even if you leave encodings as is, this will help.

                          Attached.

                          This is straightforward, except for one thing: MBtoWC and WCtoMB are bad
                          at handling streamed data. It has no way to nicely handle a final
                          incomplete sequence.

                          I've dealt with this by testing both; if converting the whole string
                          doesn't work, bump the last character into the save buffer and try
                          again. (It doesn't actually convert, so it shouldn't be too slow, but
                          if it is I can optimize by doing this test manually.)

                          Once this is tested, along with the clipboard, we might try asking any
                          resident CJK people to try working in encoding=UTF-8 and tell us if they
                          notice any problems (or if any problems are fixed.)

                          That's probably the best way to see if using UTF-8 will cause problems:
                          fix the known ones (which I'm doing) and then ask people to try it.
                          Drop the theoreticals. :)

                          --
                          Glenn Maynard
                        • Bram Moolenaar
                          ... Ehm, this removes HanExtTextOut() completely. I m not sure if that is a good idea, it wan t added for fun. But it should certainly be called in a
                          Message 12 of 14 , Jul 18, 2002
                            Glenn Maynard wrote:

                            > On Wed, Jul 17, 2002 at 12:54:07PM +0200, Bram Moolenaar wrote:
                            > > This indeed looks very broken. ImeGetTempComposition() will always
                            > > return NULL. It already was like this in 6.0. I didn't hear specific
                            > > complaints about this. I suppose this means we might as well remove
                            > > this code.
                            >
                            > Alright. Patch attached:
                            >
                            > gui_w32.c:
                            > Remove HanExtTextOut, bInComposition, ImeGetTempComposition,
                            > DisplayCompStringOpaque.
                            >
                            > Removed a now-unneeded pair of braces from gui_mch_draw_string. I
                            > didn't unindent the block of code between it, since that'd bloat the
                            > patch with stuff that looks like changes but isn't. (I wish patch
                            > could handle this better.) I'll let you do that.

                            Ehm, this removes HanExtTextOut() completely. I'm not sure if that is a
                            good idea, it wan't added for fun. But it should certainly be called in
                            a different way, and ImeGetTempComposition() can be deleted. The check
                            for the sysfixed size should be done before calling HanExtTextOut(), and
                            the call to ExtTextOut() inside HanExtTextOut() removed. This is a lot
                            safer than deleting that code without knowing why it was there.

                            > Remaining problems in the renderer:
                            >
                            > SBCS encodings should render with "is_funky_dbcs"; right now we'll get
                            > garbage unless the encoding matches the system.

                            I suppose the flag should be "non_system_codepage", which is set when
                            using an 'encoding' different from the active codepage. I suppose it's
                            not useful for Unicode values of 'encoding' though.

                            I can't overview all the other remarks. I hope the people who worked on
                            this code can respond. There is a lot of trial-and-error stuff in here,
                            unfortunately.

                            > I think these problems are ultimately because there are too many code
                            > paths. I'd suggest always converting the string to Unicode at the start
                            > of the function. (At least in NT, it shouldn't be a speed hit, since I
                            > believe the font API will convert to Unicode anyway.) The RL (RevOut)
                            > special case needs to be done in Unicode before this will work. (Which
                            > I have code for, but am holding off on in the interests of keeping the
                            > patch down.)

                            That's quite a drastic change. I'm not really sure it's worth taking
                            the risc.

                            > > Perhaps it goes wrong when the codepage is set wrong? This is
                            > > especially for 8-bit codepages where you probably don't notice the
                            > > mistake if you do use the right font. Conversion to Unicode will reveal
                            > > the problem.
                            >
                            > It'd cause problems with the clipboard, though.

                            Unless the program on the other side has the same (reverse) problem.

                            > Hmm. I think the default encodings could be improved a bit. For
                            > example, if a user tries to load a file, and it fails, they're likely
                            > to first change "encoding". Of course, changing that alone isn't
                            > correct, but it may happen to work. Or, they may change just
                            > "fileencodings", which is closer, but that probably won't work, since
                            > you can't convert most other encodings to latin1. You have to change two
                            > values, and it takes a bit of reading to figure out exactly what to do.
                            > (Enough that a lot of users are likely to get it wrong.)

                            Changing 'encoding' should be discouraged, because it invalidates text
                            in other buffers, registers, etc. Best is to use something like:

                            :edit ++enc=cp123 file

                            Setting 'fileencodings' to a good value would also work. This mostly
                            requires a Unicode value for 'encoding'.

                            > First, I'd change the default "encodings" to UTF-8 in Windows. "latin1"
                            > is only a reasonable default if the system encoding happens to be one
                            > that's like latin1; UTF-8 is almost always a better default (especially
                            > since most users should be able to leave it alone.)

                            The problem with using "utf-8" as a default is that a new file will get
                            this encoding, which is probably not what people expect. I think a new
                            file should probably use the active codepage as a default, unless the
                            user has selected something else.

                            Also don't forget that what we call "latin1" is actually any 8-bit
                            encoding. It doesn't have to be the right name, it only matters when
                            doing conversions. Quite often we don't actually know what encoding is
                            being used and fall back to using latin1. This applies more to Unix
                            than MS-Windows though, since MS-Windows supplies a function to obtain
                            the active codepage.

                            > Second, I'd change the default Unicode fencs from "ucs-bom,utf-8,latin1"
                            > to "ucs-bom,utf-8,CP####" in Windows. Again, latin1 is only a
                            > reasonable default for latin1 users; setting it intelligently should
                            > work for a lot more people without changes. (I think this should let
                            > most people edit locally-encoded files without having to touch encoding-
                            > related settings at all.)

                            Same issue. Instead of "latin1" the active codepage could be used, if
                            it is an 8-bit encoding. Otherwise it must be "latin1", because we
                            always need to fall back to an 8-bit encoding.

                            > > > I'll work on making fileio use MBtoWC for codepage<->Unicode conversion
                            > > > when possible. Even if you leave encodings as is, this will help.
                            >
                            > Attached.

                            I'll look into this later.

                            --
                            You have heard the saying that if you put a thousand monkeys in a room with a
                            thousand typewriters and waited long enough, eventually you would have a room
                            full of dead monkeys.
                            (Scott Adams - The Dilbert principle)

                            /// Bram Moolenaar -- Bram@... -- http://www.moolenaar.net \\\
                            /// Creator of Vim -- http://vim.sf.net -- ftp://ftp.vim.org/pub/vim \\\
                            \\\ Project leader for A-A-P -- http://www.a-a-p.org ///
                            \\\ Help me helping AIDS orphans in Uganda - http://iccf-holland.org ///
                          • Glenn Maynard
                            ... Well, there are two parts to it: One s the fake-backslash code. I don t think that s worth keeping as is, since it s very special-case, and like you said,
                            Message 13 of 14 , Jul 18, 2002
                              On Thu, Jul 18, 2002 at 10:02:00PM +0200, Bram Moolenaar wrote:
                              > Ehm, this removes HanExtTextOut() completely. I'm not sure if that is a
                              > good idea, it wan't added for fun. But it should certainly be called in
                              > a different way, and ImeGetTempComposition() can be deleted. The check
                              > for the sysfixed size should be done before calling HanExtTextOut(), and
                              > the call to ExtTextOut() inside HanExtTextOut() removed. This is a lot
                              > safer than deleting that code without knowing why it was there.

                              Well, there are two parts to it:

                              One's the fake-backslash code. I don't think that's worth keeping as
                              is, since it's very special-case, and like you said, most Japanese users
                              don't mind (expect, actually) the Yen sign. (I do have a replacement,
                              since *I* mind the Yen sign, but it's not urgent.)

                              The important one is the Korean IME code. I'm not sure if that's needed at
                              all. I'm *guessing* that originally, the partially-composed characters
                              weren't being displayed, and this was hacked in. However, it's wrong; the
                              right way is to tell the IME where the cursor is so it renders it. This is
                              being done now, due to the IME patch, so I believe this is completely obsolete.

                              Of course, that's a guess. Could we contact whoever wrote this patch
                              and ask them if the current Vim release is working for him in the IME?
                              (Searching the archive for Sung-Hoon Baek didn't help.)

                              > I suppose the flag should be "non_system_codepage", which is set when
                              > using an 'encoding' different from the active codepage. I suppose it's
                              > not useful for Unicode values of 'encoding' though.

                              I'd say "not_system_codepage". (Unicode is not a system codepage; it's not
                              a non-system-codepage.)

                              > > I think these problems are ultimately because there are too many code
                              > > paths. I'd suggest always converting the string to Unicode at the start
                              > > of the function. (At least in NT, it shouldn't be a speed hit, since I
                              >
                              > That's quite a drastic change. I'm not really sure it's worth taking
                              > the risc.

                              It could be done piecemeal over a number of releases. However, if
                              encoding can default to UTF-8, then all that's really important is
                              getting UTF-8 rendering right in all cases, and that's something that
                              should be done anyway.

                              > The problem with using "utf-8" as a default is that a new file will get
                              > this encoding, which is probably not what people expect. I think a new
                              > file should probably use the active codepage as a default, unless the
                              > user has selected something else.

                              New files get the "fileencoding" encoding, right? Make fileencoding do
                              this, too, then.

                              > Also don't forget that what we call "latin1" is actually any 8-bit
                              > encoding. It doesn't have to be the right name, it only matters when
                              > doing conversions. Quite often we don't actually know what encoding is
                              > being used and fall back to using latin1. This applies more to Unix
                              > than MS-Windows though, since MS-Windows supplies a function to obtain
                              > the active codepage.

                              However, conversions are important. It's pretty confusing if Vim is
                              appearing to load and display a file correctly, but sends junk to the
                              clipboard because the conversion was done incorrectly.

                              > Same issue. Instead of "latin1" the active codepage could be used, if
                              > it is an 8-bit encoding. Otherwise it must be "latin1", because we
                              > always need to fall back to an 8-bit encoding.

                              What about fencs="ucs-bom,utf-8,CP####,latin1"?

                              --
                              Glenn Maynard
                            Your message has been successfully submitted and would be delivered to recipients shortly.