Loading ...
Sorry, an error occurred while loading the content.
 

Re: mbyte.c patch

Expand Messages
  • Glenn Maynard
    On Wed, Jul 17, 2002 at 12:54:07PM +0200, Bram Moolenaar wrote: This indeed looks very broken. ImeGetTempComposition() will always return NULL. It
    Message 1 of 14 , Jul 17, 2002
      On Wed, Jul 17, 2002 at 12:54:07PM +0200, Bram Moolenaar wrote:
      > This indeed looks very broken. ImeGetTempComposition() will always
      > return NULL. It already was like this in 6.0. I didn't hear specific
      > complaints about this. I suppose this means we might as well remove
      > this code.

      Alright. Patch attached:

      gui_w32.c:
      Remove HanExtTextOut, bInComposition, ImeGetTempComposition,
      DisplayCompStringOpaque.

      Removed a now-unneeded pair of braces from gui_mch_draw_string. I
      didn't unindent the block of code between it, since that'd bloat the
      patch with stuff that looks like changes but isn't. (I wish patch
      could handle this better.) I'll let you do that.

      Remaining problems in the renderer:

      SBCS encodings should render with "is_funky_dbcs"; right now we'll get
      garbage unless the encoding matches the system.

      The Arabic/Hebrew hack isn't always done. If we pass that range of text
      to the renderer as a block, we'll get weird results. Even if we don't
      have RL support enabled (or even compiled), and we're in Unicode, and the
      Hebrew text wouldn't look right to a Hebrew user, we still might be a
      regular user editing mixed-language text; we don't want weird things
      happening.

      len isn't set right in is_funky_dbcs; wide characters cause underlining
      glitches.

      Padding isn't set right in Unicode or is_funky_dbcs, so bold and italic
      characters cause spacing glitches.

      Also in this patch:

      gui_w32.c:
      Move padding computation to a function (set_padding).

      Move ETO_IGNORELANGUAGE setting out of each individual call, to
      make sure it's always used.

      Set padding in Unicode.

      (I'd have put these in a separate patch, but they're working on the same
      block of code.)

      This fixes bold/italics in Unicode, and simplifies the renderer a
      little.

      I think these problems are ultimately because there are too many code
      paths. I'd suggest always converting the string to Unicode at the start
      of the function. (At least in NT, it shouldn't be a speed hit, since I
      believe the font API will convert to Unicode anyway.) The RL (RevOut)
      special case needs to be done in Unicode before this will work. (Which
      I have code for, but am holding off on in the interests of keeping the
      patch down.)

      If this was done, the rest of the above problems would just go away.

      > > The only way I can see system conversions causing problems is if round-trip
      > > conversion fails. I don't know of any cases of this, but if anyone does I'd
      > > definitely like to know.
      >
      > Perhaps it goes wrong when the codepage is set wrong? This is
      > especially for 8-bit codepages where you probably don't notice the
      > mistake if you do use the right font. Conversion to Unicode will reveal
      > the problem.

      It'd cause problems with the clipboard, though.

      Hmm. I think the default encodings could be improved a bit. For
      example, if a user tries to load a file, and it fails, they're likely
      to first change "encoding". Of course, changing that alone isn't
      correct, but it may happen to work. Or, they may change just
      "fileencodings", which is closer, but that probably won't work, since
      you can't convert most other encodings to latin1. You have to change two
      values, and it takes a bit of reading to figure out exactly what to do.
      (Enough that a lot of users are likely to get it wrong.)

      First, I'd change the default "encodings" to UTF-8 in Windows. "latin1"
      is only a reasonable default if the system encoding happens to be one
      that's like latin1; UTF-8 is almost always a better default (especially
      since most users should be able to leave it alone.)

      Second, I'd change the default Unicode fencs from "ucs-bom,utf-8,latin1"
      to "ucs-bom,utf-8,CP####" in Windows. Again, latin1 is only a
      reasonable default for latin1 users; setting it intelligently should
      work for a lot more people without changes. (I think this should let
      most people edit locally-encoded files without having to touch encoding-
      related settings at all.)

      > > I'll work on making fileio use MBtoWC for codepage<->Unicode conversion
      > > when possible. Even if you leave encodings as is, this will help.

      Attached.

      This is straightforward, except for one thing: MBtoWC and WCtoMB are bad
      at handling streamed data. It has no way to nicely handle a final
      incomplete sequence.

      I've dealt with this by testing both; if converting the whole string
      doesn't work, bump the last character into the save buffer and try
      again. (It doesn't actually convert, so it shouldn't be too slow, but
      if it is I can optimize by doing this test manually.)

      Once this is tested, along with the clipboard, we might try asking any
      resident CJK people to try working in encoding=UTF-8 and tell us if they
      notice any problems (or if any problems are fixed.)

      That's probably the best way to see if using UTF-8 will cause problems:
      fix the known ones (which I'm doing) and then ask people to try it.
      Drop the theoreticals. :)

      --
      Glenn Maynard
    • Bram Moolenaar
      ... Ehm, this removes HanExtTextOut() completely. I m not sure if that is a good idea, it wan t added for fun. But it should certainly be called in a
      Message 2 of 14 , Jul 18, 2002
        Glenn Maynard wrote:

        > On Wed, Jul 17, 2002 at 12:54:07PM +0200, Bram Moolenaar wrote:
        > > This indeed looks very broken. ImeGetTempComposition() will always
        > > return NULL. It already was like this in 6.0. I didn't hear specific
        > > complaints about this. I suppose this means we might as well remove
        > > this code.
        >
        > Alright. Patch attached:
        >
        > gui_w32.c:
        > Remove HanExtTextOut, bInComposition, ImeGetTempComposition,
        > DisplayCompStringOpaque.
        >
        > Removed a now-unneeded pair of braces from gui_mch_draw_string. I
        > didn't unindent the block of code between it, since that'd bloat the
        > patch with stuff that looks like changes but isn't. (I wish patch
        > could handle this better.) I'll let you do that.

        Ehm, this removes HanExtTextOut() completely. I'm not sure if that is a
        good idea, it wan't added for fun. But it should certainly be called in
        a different way, and ImeGetTempComposition() can be deleted. The check
        for the sysfixed size should be done before calling HanExtTextOut(), and
        the call to ExtTextOut() inside HanExtTextOut() removed. This is a lot
        safer than deleting that code without knowing why it was there.

        > Remaining problems in the renderer:
        >
        > SBCS encodings should render with "is_funky_dbcs"; right now we'll get
        > garbage unless the encoding matches the system.

        I suppose the flag should be "non_system_codepage", which is set when
        using an 'encoding' different from the active codepage. I suppose it's
        not useful for Unicode values of 'encoding' though.

        I can't overview all the other remarks. I hope the people who worked on
        this code can respond. There is a lot of trial-and-error stuff in here,
        unfortunately.

        > I think these problems are ultimately because there are too many code
        > paths. I'd suggest always converting the string to Unicode at the start
        > of the function. (At least in NT, it shouldn't be a speed hit, since I
        > believe the font API will convert to Unicode anyway.) The RL (RevOut)
        > special case needs to be done in Unicode before this will work. (Which
        > I have code for, but am holding off on in the interests of keeping the
        > patch down.)

        That's quite a drastic change. I'm not really sure it's worth taking
        the risc.

        > > Perhaps it goes wrong when the codepage is set wrong? This is
        > > especially for 8-bit codepages where you probably don't notice the
        > > mistake if you do use the right font. Conversion to Unicode will reveal
        > > the problem.
        >
        > It'd cause problems with the clipboard, though.

        Unless the program on the other side has the same (reverse) problem.

        > Hmm. I think the default encodings could be improved a bit. For
        > example, if a user tries to load a file, and it fails, they're likely
        > to first change "encoding". Of course, changing that alone isn't
        > correct, but it may happen to work. Or, they may change just
        > "fileencodings", which is closer, but that probably won't work, since
        > you can't convert most other encodings to latin1. You have to change two
        > values, and it takes a bit of reading to figure out exactly what to do.
        > (Enough that a lot of users are likely to get it wrong.)

        Changing 'encoding' should be discouraged, because it invalidates text
        in other buffers, registers, etc. Best is to use something like:

        :edit ++enc=cp123 file

        Setting 'fileencodings' to a good value would also work. This mostly
        requires a Unicode value for 'encoding'.

        > First, I'd change the default "encodings" to UTF-8 in Windows. "latin1"
        > is only a reasonable default if the system encoding happens to be one
        > that's like latin1; UTF-8 is almost always a better default (especially
        > since most users should be able to leave it alone.)

        The problem with using "utf-8" as a default is that a new file will get
        this encoding, which is probably not what people expect. I think a new
        file should probably use the active codepage as a default, unless the
        user has selected something else.

        Also don't forget that what we call "latin1" is actually any 8-bit
        encoding. It doesn't have to be the right name, it only matters when
        doing conversions. Quite often we don't actually know what encoding is
        being used and fall back to using latin1. This applies more to Unix
        than MS-Windows though, since MS-Windows supplies a function to obtain
        the active codepage.

        > Second, I'd change the default Unicode fencs from "ucs-bom,utf-8,latin1"
        > to "ucs-bom,utf-8,CP####" in Windows. Again, latin1 is only a
        > reasonable default for latin1 users; setting it intelligently should
        > work for a lot more people without changes. (I think this should let
        > most people edit locally-encoded files without having to touch encoding-
        > related settings at all.)

        Same issue. Instead of "latin1" the active codepage could be used, if
        it is an 8-bit encoding. Otherwise it must be "latin1", because we
        always need to fall back to an 8-bit encoding.

        > > > I'll work on making fileio use MBtoWC for codepage<->Unicode conversion
        > > > when possible. Even if you leave encodings as is, this will help.
        >
        > Attached.

        I'll look into this later.

        --
        You have heard the saying that if you put a thousand monkeys in a room with a
        thousand typewriters and waited long enough, eventually you would have a room
        full of dead monkeys.
        (Scott Adams - The Dilbert principle)

        /// Bram Moolenaar -- Bram@... -- http://www.moolenaar.net \\\
        /// Creator of Vim -- http://vim.sf.net -- ftp://ftp.vim.org/pub/vim \\\
        \\\ Project leader for A-A-P -- http://www.a-a-p.org ///
        \\\ Help me helping AIDS orphans in Uganda - http://iccf-holland.org ///
      • Glenn Maynard
        ... Well, there are two parts to it: One s the fake-backslash code. I don t think that s worth keeping as is, since it s very special-case, and like you said,
        Message 3 of 14 , Jul 18, 2002
          On Thu, Jul 18, 2002 at 10:02:00PM +0200, Bram Moolenaar wrote:
          > Ehm, this removes HanExtTextOut() completely. I'm not sure if that is a
          > good idea, it wan't added for fun. But it should certainly be called in
          > a different way, and ImeGetTempComposition() can be deleted. The check
          > for the sysfixed size should be done before calling HanExtTextOut(), and
          > the call to ExtTextOut() inside HanExtTextOut() removed. This is a lot
          > safer than deleting that code without knowing why it was there.

          Well, there are two parts to it:

          One's the fake-backslash code. I don't think that's worth keeping as
          is, since it's very special-case, and like you said, most Japanese users
          don't mind (expect, actually) the Yen sign. (I do have a replacement,
          since *I* mind the Yen sign, but it's not urgent.)

          The important one is the Korean IME code. I'm not sure if that's needed at
          all. I'm *guessing* that originally, the partially-composed characters
          weren't being displayed, and this was hacked in. However, it's wrong; the
          right way is to tell the IME where the cursor is so it renders it. This is
          being done now, due to the IME patch, so I believe this is completely obsolete.

          Of course, that's a guess. Could we contact whoever wrote this patch
          and ask them if the current Vim release is working for him in the IME?
          (Searching the archive for Sung-Hoon Baek didn't help.)

          > I suppose the flag should be "non_system_codepage", which is set when
          > using an 'encoding' different from the active codepage. I suppose it's
          > not useful for Unicode values of 'encoding' though.

          I'd say "not_system_codepage". (Unicode is not a system codepage; it's not
          a non-system-codepage.)

          > > I think these problems are ultimately because there are too many code
          > > paths. I'd suggest always converting the string to Unicode at the start
          > > of the function. (At least in NT, it shouldn't be a speed hit, since I
          >
          > That's quite a drastic change. I'm not really sure it's worth taking
          > the risc.

          It could be done piecemeal over a number of releases. However, if
          encoding can default to UTF-8, then all that's really important is
          getting UTF-8 rendering right in all cases, and that's something that
          should be done anyway.

          > The problem with using "utf-8" as a default is that a new file will get
          > this encoding, which is probably not what people expect. I think a new
          > file should probably use the active codepage as a default, unless the
          > user has selected something else.

          New files get the "fileencoding" encoding, right? Make fileencoding do
          this, too, then.

          > Also don't forget that what we call "latin1" is actually any 8-bit
          > encoding. It doesn't have to be the right name, it only matters when
          > doing conversions. Quite often we don't actually know what encoding is
          > being used and fall back to using latin1. This applies more to Unix
          > than MS-Windows though, since MS-Windows supplies a function to obtain
          > the active codepage.

          However, conversions are important. It's pretty confusing if Vim is
          appearing to load and display a file correctly, but sends junk to the
          clipboard because the conversion was done incorrectly.

          > Same issue. Instead of "latin1" the active codepage could be used, if
          > it is an 8-bit encoding. Otherwise it must be "latin1", because we
          > always need to fall back to an 8-bit encoding.

          What about fencs="ucs-bom,utf-8,CP####,latin1"?

          --
          Glenn Maynard
        Your message has been successfully submitted and would be delivered to recipients shortly.