Loading ...
Sorry, an error occurred while loading the content.

6.3a on Win32: better utf-8 support, but...

Expand Messages
  • Camillo Särs
    Hi, Just installed the 6.3a BETA to check out the changes in utf-8 support, or to be precise, how utf-8 encoding works on WinXP. It s been good in 6.2
    Message 1 of 7 , May 6 8:17 AM
    • 0 Attachment
      Hi,

      Just installed the 6.3a BETA to check out the changes in utf-8 support, or to
      be precise, how utf-8 encoding works on WinXP. It's been "good" in 6.2
      already, but has some problems - mostly cosmetic.

      Apparently some, but not all, code paths that should use "Wide" versions of
      the Win32 API have now been converted from ANSI to Wide. There is an
      interesting discrepancy left, however. In the following I'm using "åäö.txt"
      as a filename.

      If I open a file using "File - Open", I get the ordinary Windows file open
      dialog. If I select a file name with a character > 127 in this dialog, even
      the titlebar will display the name "åäö.txt" correctly. Presumable the I/O is
      done using Wide functions.

      If I list a directory using ":edit [path]", the listing displays
      "<e5><e4><f6>.txt" and if I open the file from the browser, the titlebar also
      shows "<e5><e4><f6>.txt". If I use file completion, it shows
      "<e5><e4><f6>.txt". Editing the file is still possible, of course.

      Even more interesting: If I say :w åäö.txt, I get "åäö.txt" ... written, and
      the file name is now displayed in the title bar as "åäö.txt".

      Directories that contain characters >127 are even more troublesome - the
      built-in browser and file completion does not "see" them at all! They are all
      accessible using the "File - Open" dialog, though.

      Camillo
      --
      Camillo Särs <+ged+@...> ** Aim for the impossible and you
      <http://www.iki.fi/+ged> ** will achieve the improbable.
      PGP public key available **
    • Bram Moolenaar
      ... Those are latin1 characters, in case someone was wondering. non-latin1 characters still have the problem that they appear as question marks in the title
      Message 2 of 7 , May 6 9:26 AM
      • 0 Attachment
        Camillo wrote:

        > Just installed the 6.3a BETA to check out the changes in utf-8
        > support, or to be precise, how utf-8 encoding works on WinXP. It's
        > been "good" in 6.2 already, but has some problems - mostly cosmetic.
        >
        > Apparently some, but not all, code paths that should use "Wide"
        > versions of the Win32 API have now been converted from ANSI to Wide.
        > There is an interesting discrepancy left, however. In the following
        > I'm using "åäö.txt" as a filename.

        Those are latin1 characters, in case someone was wondering. non-latin1
        characters still have the problem that they appear as question marks in
        the title for me. Don't know how to solve that.

        > If I open a file using "File - Open", I get the ordinary Windows file
        > open dialog. If I select a file name with a character > 127 in this
        > dialog, even the titlebar will display the name "åäö.txt"
        > correctly. Presumable the I/O is done using Wide functions.

        Yes.

        > If I list a directory using ":edit [path]", the listing displays
        > "<e5><e4><f6>.txt"

        With listing, do you mean using CTRL-D?

        > and if I open the file from the browser, the titlebar also
        > shows "<e5><e4><f6>.txt". If I use file completion, it shows
        > "<e5><e4><f6>.txt". Editing the file is still possible, of course.

        The completion apparently is not multi-byte aware. That editing still
        works is because illegal bytes are accepted.

        > Even more interesting: If I say :w åäö.txt, I get "åäö.txt" ...
        > written, and the file name is now displayed in the title bar as
        > "åäö.txt".

        It's still wrong for me.

        > Directories that contain characters >127 are even more troublesome -
        > the built-in browser and file completion does not "see" them at all!
        > They are all accessible using the "File - Open" dialog, though.

        That all appears to be the same problem, that completion doesn't use
        wide functions. I'll see if that can be fixed (without changing too
        much).

        --
        hundred-and-one symptoms of being an internet addict:
        134. You consider bandwidth to be more important than carats.

        /// Bram Moolenaar -- Bram@... -- http://www.Moolenaar.net \\\
        /// Sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ \\\
        \\\ Project leader for A-A-P -- http://www.A-A-P.org ///
        \\\ Buy at Amazon and help AIDS victims -- http://ICCF.nl/click1.html ///
      • Camillo Särs
        ... Well, yes, part of the latin-1 repertoire, although I used utf-8 while composing. Your mailer apparently couldn t cope with that correctly. At least it
        Message 3 of 7 , May 6 12:22 PM
        • 0 Attachment
          Bram Moolenaar wrote:
          > Those are latin1 characters, in case someone was wondering.

          Well, yes, part of the latin-1 repertoire, although I used utf-8 while
          composing. Your mailer apparently couldn't cope with that correctly. At
          least it sent the utf-8 characters as latin-1 bytes. Just to make sure,
          here they are again, in iso-8859-1 this time: "åäö.txt".

          > non-latin1 characters still have the problem that they appear as
          > question marks in the title for me. Don't know how to solve that.

          There are a few possible reasons:
          - You are using some code page dependent function, i.e. characters within
          the code page display OK, but the rest are invalid and get '?'.
          - You are using the correct function, feeding it Unicode characters, but the
          font used to display the title bar does not contain glyphs for those
          characters. In this case, however, I think the Unicode glyph for "not
          available" should be shown.

          Displaying Unicode characters is tricky, because most Windows fonts only
          contain a certain subset.

          >>If I list a directory using ":edit [path]", the listing displays
          >>"<e5><e4><f6>.txt"
          >
          > With listing, do you mean using CTRL-D?

          I meant listing as in "the directory listing displayed by vim in a window
          when I say :edit [directory name]". The file names apparently are in the
          local code page, which means that the directory listing is retrieved using
          ANSI functions. To display correctly when the encoding is utf-8, you should
          get the listing using Wide functions. This would give you utf-16, which is
          easy to convert to utf-8. :)

          > The completion apparently is not multi-byte aware. That editing still
          > works is because illegal bytes are accepted.

          Which from my perspective is a very good design - the failure mode still
          makes editing possible.

          >>Even more interesting: If I say :w åäö.txt, I get "åäö.txt" ...
          >>written, and the file name is now displayed in the title bar as
          >>"åäö.txt".
          >
          > It's still wrong for me.

          Sorry, a typo. Meant to write ":save åäö.txt". :w does not switch to the
          new file name, so of course the title bar does not change. Mea culpa. With
          :save you get the behavior I described.

          > That all appears to be the same problem, that completion doesn't use
          > wide functions. I'll see if that can be fixed (without changing too
          > much).

          That would be great.

          Camillo
          --
          Camillo Särs <ged@...> Aim for the impossible and you
          http://www.iki.fi/ged will achieve the improbable
        • Bram Moolenaar
          ... I know they were in utf-8, I just meant to say the characters are included in the latin1 charset, thus you don t get the problems related to non-latin1
          Message 4 of 7 , May 7 4:22 AM
          • 0 Attachment
            Camillo Särs wrote:

            > > Those are latin1 characters, in case someone was wondering.
            >
            > Well, yes, part of the latin-1 repertoire, although I used utf-8 while
            > composing. Your mailer apparently couldn't cope with that correctly.
            > At least it sent the utf-8 characters as latin-1 bytes. Just to make
            > sure, here they are again, in iso-8859-1 this time: "åäö.txt".

            I know they were in utf-8, I just meant to say the characters are
            included in the latin1 charset, thus you don't get the problems related
            to non-latin1 characters.

            > > non-latin1 characters still have the problem that they appear as
            > > question marks in the title for me. Don't know how to solve that.
            >
            > There are a few possible reasons:
            > - You are using some code page dependent function, i.e. characters
            > within the code page display OK, but the rest are invalid and get '?'.

            The SetWindowTextW() function is used, that should work OK.

            > - You are using the correct function, feeding it Unicode characters, but the
            > font used to display the title bar does not contain glyphs for those
            > characters. In this case, however, I think the Unicode glyph for "not
            > available" should be shown.
            > Displaying Unicode characters is tricky, because most Windows fonts only
            > contain a certain subset.

            I tried changing the font, but that didn't solve the problem. I used
            the same font that displays the characters OK inside Vim.

            > >>If I list a directory using ":edit [path]", the listing displays
            > >>"<e5><e4><f6>.txt"
            > >
            > > With listing, do you mean using CTRL-D?
            >
            > I meant listing as in "the directory listing displayed by vim in a window
            > when I say :edit [directory name]". The file names apparently are in the
            > local code page, which means that the directory listing is retrieved using
            > ANSI functions. To display correctly when the encoding is utf-8, you should
            > get the listing using Wide functions. This would give you utf-16, which is
            > easy to convert to utf-8. :)

            The explorer plugin uses the glob() function, which in turn uses the
            same functions used for completion. Thus it's still the same problem.

            > > The completion apparently is not multi-byte aware. That editing still
            > > works is because illegal bytes are accepted.
            >
            > Which from my perspective is a very good design - the failure mode still
            > makes editing possible.

            Some people argue this is a security risk, but I have never understood
            why.

            --
            hundred-and-one symptoms of being an internet addict:
            161. You get up before the sun rises to check your e-mail, and you
            find yourself in the very same chair long after the sun has set.

            /// Bram Moolenaar -- Bram@... -- http://www.Moolenaar.net \\\
            /// Sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ \\\
            \\\ Project leader for A-A-P -- http://www.A-A-P.org ///
            \\\ Buy at Amazon and help AIDS victims -- http://ICCF.nl/click1.html ///
          • Glenn Maynard
            ... If you want to test wide characters, you need to use characters that aren t included in your ANSI codepage, eg. 日本語 . ... You have to create the a
            Message 5 of 7 , May 7 2:30 PM
            • 0 Attachment
              On Thu, May 06, 2004 at 06:26:01PM +0200, Bram Moolenaar wrote:
              > > Apparently some, but not all, code paths that should use "Wide"
              > > versions of the Win32 API have now been converted from ANSI to Wide.
              > > There is an interesting discrepancy left, however. In the following
              > > I'm using "åäö.txt" as a filename.

              If you want to test wide characters, you need to use characters that
              aren't included in your ANSI codepage, eg. "日本語".

              > Those are latin1 characters, in case someone was wondering. non-latin1
              > characters still have the problem that they appear as question marks in
              > the title for me. Don't know how to solve that.

              You have to create the a wide window class.

              Instead of setting up WNDCLASS and calling RegisterClass, first set up
              and WNDCLASSW and call RegisterClassW.

              If it fails with ERROR_CALL_NOT_IMPLEMENTED, set up the regular WNDCLASS
              as usual (for Win9x).

              Finally, at the bottom of your WndProc, call DefWindowProcW instead of
              DefWindowProcA if RegisterClassW was used.

              This makes the window capable of displaying Unicode text in the titlebar;
              otherwise, even if you pass Unicode data in, the low-level internal stuff
              that actually draws the text will just print "?".

              That should be enough to make SetWindowTextW work. (Of course, a
              fallback on SetWindowTextA on ERROR_CALL_NOT_IMPLEMENTED is also needed.)

              It's not a lot of work, but as there are plenty of other places that don't
              use wide system calls, I didn't bother fixing it.

              --
              Glenn Maynard
            • Bram Moolenaar
              ... Great, that is the hint I needed. I ll try searching for a bit of example code, especialy for handling the errors. ... Generally using utf-8 for
              Message 6 of 7 , May 8 4:45 AM
              • 0 Attachment
                Glenn Maynard wrote:

                > On Thu, May 06, 2004 at 06:26:01PM +0200, Bram Moolenaar wrote:
                > > > Apparently some, but not all, code paths that should use "Wide"
                > > > versions of the Win32 API have now been converted from ANSI to Wide.
                > > > There is an interesting discrepancy left, however. In the following
                > > > I'm using "åäö.txt" as a filename.
                >
                > If you want to test wide characters, you need to use characters that
                > aren't included in your ANSI codepage, eg. "日本語".
                >
                > > Those are latin1 characters, in case someone was wondering. non-latin1
                > > characters still have the problem that they appear as question marks in
                > > the title for me. Don't know how to solve that.
                >
                > You have to create the a wide window class.
                >
                > Instead of setting up WNDCLASS and calling RegisterClass, first set up
                > and WNDCLASSW and call RegisterClassW.
                >
                > If it fails with ERROR_CALL_NOT_IMPLEMENTED, set up the regular WNDCLASS
                > as usual (for Win9x).
                >
                > Finally, at the bottom of your WndProc, call DefWindowProcW instead of
                > DefWindowProcA if RegisterClassW was used.
                >
                > This makes the window capable of displaying Unicode text in the titlebar;
                > otherwise, even if you pass Unicode data in, the low-level internal stuff
                > that actually draws the text will just print "?".
                >
                > That should be enough to make SetWindowTextW work. (Of course, a
                > fallback on SetWindowTextA on ERROR_CALL_NOT_IMPLEMENTED is also needed.)

                Great, that is the hint I needed. I'll try searching for a bit of
                example code, especialy for handling the errors.

                > It's not a lot of work, but as there are plenty of other places that don't
                > use wide system calls, I didn't bother fixing it.

                Generally using utf-8 for 'encoding' is a good thing to do on the long
                term. I'm trying to remove all disadvantages, so that using utf-8 will
                become the generic solution to problems with encodings, on all systems
                in all environments.

                --
                hundred-and-one symptoms of being an internet addict:
                179. You wonder why your household garbage can doesn't have an
                "empty recycle bin" button.

                /// Bram Moolenaar -- Bram@... -- http://www.Moolenaar.net \\\
                /// Sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ \\\
                \\\ Project leader for A-A-P -- http://www.A-A-P.org ///
                \\\ Buy at Amazon and help AIDS victims -- http://ICCF.nl/click1.html ///
              • Glenn Maynard
                ... http://zewt.org/~glenn/window.c is the code I wrote for Putty. (see lines 450, 2569.) It hasn t been integrated upstream, though, so it doesn t have wide
                Message 7 of 7 , May 8 2:55 PM
                • 0 Attachment
                  On Sat, May 08, 2004 at 01:45:13PM +0200, Bram Moolenaar wrote:
                  > Great, that is the hint I needed. I'll try searching for a bit of
                  > example code, especialy for handling the errors.

                  http://zewt.org/~glenn/window.c

                  is the code I wrote for Putty. (see lines 450, 2569.) It hasn't been
                  integrated upstream, though, so it doesn't have wide testing.

                  > Generally using utf-8 for 'encoding' is a good thing to do on the long
                  > term. I'm trying to remove all disadvantages, so that using utf-8 will
                  > become the generic solution to problems with encodings, on all systems
                  > in all environments.

                  I agree, of course--I've wanted UTF-8 to become the default internal encoding
                  for Vim in Windows for a while.

                  (In this case, this isn't really a disadvantage of UTF-8, though; ACP strings
                  do work ...)

                  --
                  Glenn Maynard
                Your message has been successfully submitted and would be delivered to recipients shortly.