Loading ...
Sorry, an error occurred while loading the content.

Filename encodings under Win32

Expand Messages
  • Camillo Särs
    Hi, (vim 6.2, WinXP) If I use the UTF-8 encoding, and enter non-ascii characters in filenames, they also use the UTF-8 encoding. That s clearly wrong on
    Message 1 of 29 , Oct 10, 2003
    • 0 Attachment
      Hi,

      (vim 6.2, WinXP)

      If I use the UTF-8 encoding, and enter non-ascii characters in filenames,
      they also use the UTF-8 encoding. That's clearly wrong on Win32. It's
      equally clearly right on most unixes.

      The windows APIs come in two flavors - "ANSI" and "Unicode". The former
      requires filenames to be in the correct codepage, the latter expects native
      Unicode (UCS-2).

      To avoid a lot of codepage mess, I would suggest that the "right way" to
      fix this would be to internally convert all strings passed to the Windows
      api into Unicode. And of course then calling the unicode-versions of the
      functions. The alternative would be to call the ANSI versions, which would
      be plain silly. Firstly, because they only cover the current codepage, and
      secondly because internally NT converts those strings to Unicode anyway.

      I'm not sure how much work this would actually be, but until this is
      implemented, Unicode support on Win32 remains partially broken. For many
      users, using us-ascii only in filenames is not a problem, but for those who
      need special characters and want utf-8, this is really a big issue.

      Am I right in my diagnosis, or have I overlooked something essential?

      Cheers,
      Camillo
    • Glenn Maynard
      Hmm. My reply appears to have vanished without a trace. I d attached os_win32.c (not noticing that it was an extremely oversized source file--over 100k); I m
      Message 2 of 29 , Oct 12, 2003
      • 0 Attachment
        Hmm. My reply appears to have vanished without a trace. I'd attached
        os_win32.c (not noticing that it was an extremely oversized source
        file--over 100k); I'm resposting with the source file linked. It's
        strange that I didn't see any kind of rejection notice.

        On Fri, Oct 10, 2003 at 03:16:27PM +0300, Camillo Särs wrote:
        > To avoid a lot of codepage mess, I would suggest that the "right way" to
        > fix this would be to internally convert all strings passed to the Windows
        > api into Unicode. And of course then calling the unicode-versions of the
        > functions. The alternative would be to call the ANSI versions, which would
        > be plain silly. Firstly, because they only cover the current codepage, and
        > secondly because internally NT converts those strings to Unicode anyway.

        Most Unicode versions of functions aren't available in 9x, and codepage-
        encoded strings must continue to work correctly.

        > I'm not sure how much work this would actually be, but until this is
        > implemented, Unicode support on Win32 remains partially broken. For many
        > users, using us-ascii only in filenames is not a problem, but for those who
        > need special characters and want utf-8, this is really a big issue.

        You can always use characters that are in your system encoding. Western
        (CP1242) users can always use ñ, for example. NT can handle Unicode
        filenames, but 9x can't.

        Of course, Vim should handle Unicode for filenames (and message boxes,
        and so on). Not doing so is hardly up to the quality of Vim's i18n
        support. However, there's not enough demand for it, so it just hasn't
        been done yet.

        I wrote some code to change the filesystem layer to handle Unicode a
        while back, including ANSI fallbacks. I didn't bother spending the time
        to get it cleaned, tested and applied, because so few other programs
        support this (and, as Vim patch turnaround time is understandably latent,
        I had more important patches to work on). For example, an Ogg or MP3 with
        Japanese in the filename simply can't be loaded by Winamp in Windows, at
        all, unless you're in a Japanese codepage.

        I found some old copy of this code and attached it[1]. If you want to cvs
        diff it, I believe it's from CVS rev 1.60. I don't know what state this
        was in, but it should give you an idea of what needs to be done.

        I've always wanted the default internal encoding of Vim to be UTF-8 in
        Windows. This is one thing that would need to be done to do that, along
        with all other Windows API interactions. (I've heard of printing
        problems, too, but I don't know about those as I never print.)

        [1] http://zewt.org/~glenn/os_win32.c
        Search for "system_has_unicode".

        --
        Glenn Maynard
      • Tony Mechelynck
        Glenn Maynard wrote: [...] ... [...] There are more problems than just printing. As long as fileencoding , printencoding and (most
        Message 3 of 29 , Oct 12, 2003
        • 0 Attachment
          Glenn Maynard <glenn@...> wrote:
          [...]
          > I've always wanted the default internal encoding of Vim to be UTF-8 in
          > Windows. This is one thing that would need to be done to do that,
          > along with all other Windows API interactions. (I've heard of
          > printing
          > problems, too, but I don't know about those as I never print.)
          [...]

          There are more problems than just printing.

          As long as 'fileencoding', 'printencoding' and (most important)
          'termencoding' default (when empty) to whatever is the current value of
          'encoding', the latter must not (IMHO) be set to UTF-8 by default.

          (Let's spell it out) In my humble opinion, Vim should require as little
          "tuning" as possible to handle the language interfaces the same way as the
          operating system does, and this means that, when the user sets nothing else
          in his startup and configuration files, keyboard input, printer output and
          file creation should default to whatever is set in the locale.

          If the user wants to handle Unicode files, is is quite possible to set gvim
          to do it, even in Win98 systems like mine; but this requires, among other
          things, storing the previous value of 'encoding' into 'termencoding' because
          the user cannot, by a mere snap of the fingers, change his keyboard input
          from some national encoding to Unicode. Similarly, on systems where the
          'printencoding' option is recognised, the user is not always able to change
          how the printer will react to output strings, and therefore that setting
          must also be preserved, unless of course one decides to always send fonts as
          bitmaps, and converting them to bitmaps in gvim itself, which I don't think
          desirable.

          For all these reasons, I believe that the setting of the various encodings
          used by (g)vim (namely, 'encoding', 'fileencoding', 'termencoding' and
          'printencoding', as well as a possible 8-bit encoding at the end of
          'fileencodings') should, as I believe they already do, default directly or
          indirectly to whatever is set in the locale, and that a possible switchover
          to Unicode should be left to the voluntary and reasoned choice of the user.
          A few days ago, I sent (to the vim-at-vim.org mailing list) a snippet of
          code (to be used as a vim script, or part of one) for such switchover, in
          response to an inquiry by some Swedish user, and it seems to have proven
          satisfactory; I may publish it at vim-online some day soon, if I don't
          forget.

          Best regards,
          Tony
          mailto:antoine.mechelynck@...
          http://users.skynet.be/antoine.mechelynck/
        • Glenn Maynard
          ... This is a trivial fix, which I already proposed many months ago: the defaults in Windows should be the results of exe set fileencodings=ucs-bom,utf-8,cp
          Message 4 of 29 , Oct 12, 2003
          • 0 Attachment
            On Sun, Oct 12, 2003 at 10:44:05PM +0200, Tony Mechelynck wrote:
            > As long as 'fileencoding', 'printencoding' and (most important)
            > 'termencoding' default (when empty) to whatever is the current value of
            > 'encoding', the latter must not (IMHO) be set to UTF-8 by default.
            >
            > (Let's spell it out) In my humble opinion, Vim should require as little
            > "tuning" as possible to handle the language interfaces the same way as the
            > operating system does, and this means that, when the user sets nothing else
            > in his startup and configuration files, keyboard input, printer output and
            > file creation should default to whatever is set in the locale.

            This is a trivial fix, which I already proposed many months ago: the
            defaults in Windows should be the results of

            exe "set fileencodings=ucs-bom,utf-8,cp" . getacp() . ",latin1"
            exe "set fileencoding=cp" . getacp()

            and now adding:

            exe "set printencoding=cp" . getacp()

            Note that "getacp" is a function in a patch I sent which was lost or
            forgotton: return the ANSI codepage.

            (A slightly safer default would be to remove "utf-8" from the search, to
            prevent false matches.) I havn't found any problems with this; it's been
            my default for a long time and I actively edit UTF-8 and CP932 files.

            > If the user wants to handle Unicode files, is is quite possible to set gvim
            > to do it, even in Win98 systems like mine; but this requires, among other
            > things, storing the previous value of 'encoding' into 'termencoding' because
            > the user cannot, by a mere snap of the fingers, change his keyboard input
            > from some national encoding to Unicode.

            The input in a Windows window is well-defined; "termencoding" should not
            even be needed in Windows. Depending on which messages are trapped, the
            input is always in the ANSI codepage or Unicode.

            However, if it's being used anyway for some reason, then the solution is
            the same:

            exe "set termencoding=cp" . getacp()

            The only reason I know of not to set "encoding" to "utf-8" is that Vim
            doesn't do proper conversions for Win32 calls.

            > used by (g)vim (namely, 'encoding', 'fileencoding', 'termencoding' and
            > 'printencoding', as well as a possible 8-bit encoding at the end of
            > 'fileencodings') should, as I believe they already do, default directly or
            > indirectly to whatever is set in the locale, and that a possible switchover
            > to Unicode should be left to the voluntary and reasoned choice of the user.

            Switching "encoding" to "utf-8" should be transparent, once proper
            conversions for win32 calls are in place. Regular users don't care
            about what encoding their editor uses internally, any more than they
            care about what type of data structures they use.

            On the other hand, if utf-8 internally is fully supported, then utf-8
            can be the *only* internal encoding--which would make the rendering
            code much simpler and more robust. I remember finding lots of little
            errors in the renderer (eg. underlining glitches for double-width
            characters) that went away with utf-8, and I don't think Vim renders
            correctly at all if eg. "encoding" is set to "cp1242" and the ACP
            is CP932 (needs a double conversion).

            --
            Glenn Maynard
          • Tony Mechelynck
            ... Trivial or not, my opinion is that handling files and keypresses as per the locale shouldn t be a fix , it should be the (program) default. The minor
            Message 5 of 29 , Oct 12, 2003
            • 0 Attachment
              Glenn Maynard <glenn@...> wrote:
              > On Sun, Oct 12, 2003 at 10:44:05PM +0200, Tony Mechelynck wrote:
              > > As long as 'fileencoding', 'printencoding' and (most important)
              > > 'termencoding' default (when empty) to whatever is the current
              > > value of 'encoding', the latter must not (IMHO) be set to UTF-8 by
              > > default.
              > >
              > > (Let's spell it out) In my humble opinion, Vim should require as
              > > little "tuning" as possible to handle the language interfaces the
              > > same way as the operating system does, and this means that, when
              > > the user sets nothing else in his startup and configuration files,
              > > keyboard input, printer output and file creation should default to
              > > whatever is set in the locale.
              >
              > This is a trivial fix, which I already proposed many months ago: the
              > defaults in Windows should be the results of
              >
              > exe "set fileencodings=ucs-bom,utf-8,cp" . getacp() . ",latin1"
              > exe "set fileencoding=cp" . getacp()
              >
              > and now adding:
              >
              > exe "set printencoding=cp" . getacp()
              >
              > Note that "getacp" is a function in a patch I sent which was lost or
              > forgotton: return the ANSI codepage.
              >
              > (A slightly safer default would be to remove "utf-8" from the search,
              > to prevent false matches.) I havn't found any problems with this;
              > it's been
              > my default for a long time and I actively edit UTF-8 and CP932 files.

              Trivial or not, my opinion is that handling files and keypresses as per the
              locale shouldn't be a "fix", it should be the (program) default. The "minor
              fix" consists of making Unicode the (user's) default by means of a config
              setting; but see below about that.
              >
              > > If the user wants to handle Unicode files, is is quite possible to
              > > set gvim to do it, even in Win98 systems like mine; but this
              > > requires, among other things, storing the previous value of
              > > 'encoding' into 'termencoding' because the user cannot, by a mere
              > > snap of the fingers, change his keyboard input from some national
              > > encoding to Unicode.
              >
              > The input in a Windows window is well-defined; "termencoding" should
              > not
              > even be needed in Windows. Depending on which messages are trapped,
              > the input is always in the ANSI codepage or Unicode.

              Sorry, but it is. AFAIK, leaving 'termencoding' empty when switching
              'encoding' over from something else to Unicode produces dysfunctions in the
              keyboard for all users whose actual keyboard encoding is other than 7-bit
              ASCII -- roughly speaking, for all users with a keyboard for a language
              other than English (even Dutchmen like Bram need, as a minimum, the
              "lowercase e with diaeresis", which is over 128, and therefore receives a
              different representation in UTF-8 and in other encodings -- the codepoint
              number maybe the same but it is not represented identically). That's why the
              lines

              if &termencoding == ""
              let &termencoding = &encoding
              endif

              have been put in my script set_utf8.vim (newly uploaded to vim.online),
              before the actual switch of 'encoding' ro utf-8. Thanks to this, any
              accented keys (and my own keyboard has a lot of them) go on working
              identically (i.e., transparently) after the switchover as they did before.
              Of course, making utf-8 the vim default for 'encoding' would break the above
              code, with (AFAIK) no possibility of repair in mainline Vim (which hasn't
              got the getacp() function -- and don't talk to me about a patch, I don't
              want to use other than standard binaries; for one thing, I don't have a
              compiler and I don't want to get one: messing about with nonstandard
              compilations is definitely not my cup o'tea). It would break it, I mean,
              unless the vim default for 'termencoding' would change from the empty string
              (i.e. use whatever is the current global Vim 'encoding' at the time a key is
              pressed) to the user's locale (as found in $LANG at startup). But let's keep
              things simple, not break existing scripts, reduce Bram and other people's
              workloads, and keep Vim's handling of encodings as it is (the only change
              I'd like to see is to add a functioning 'printencoding' option to Windows
              versions of gvim, even though they don't print through PostScript).
              >
              > However, if it's being used anyway for some reason, then the solution
              > is
              > the same:
              >
              > exe "set termencoding=cp" . getacp()
              >
              > The only reason I know of not to set "encoding" to "utf-8" is that Vim
              > doesn't do proper conversions for Win32 calls.

              Users who only edit files in a single 8 bit encoding don't need to bother
              about Unicode. For others, it is a useful choice, but I maintain that it
              should remain a choice, and, if the locale set in the operating system is
              not a Unicode one, it should IMHO remain a conscious choice (or at least a
              voluntary one, that need not stay conscious once it has been written into
              the vimrc).
              >
              > > used by (g)vim (namely, 'encoding', 'fileencoding', 'termencoding'
              > > and 'printencoding', as well as a possible 8-bit encoding at the
              > > end of 'fileencodings') should, as I believe they already do,
              > > default directly or indirectly to whatever is set in the locale,
              > > and that a possible switchover to Unicode should be left to the
              > > voluntary and reasoned choice of the user.
              >
              > Switching "encoding" to "utf-8" should be transparent, once proper
              > conversions for win32 calls are in place. Regular users don't care
              > about what encoding their editor uses internally, any more than they
              > care about what type of data structures they use.
              >
              > On the other hand, if utf-8 internally is fully supported, then utf-8
              > can be the *only* internal encoding--which would make the rendering
              > code much simpler and more robust. I remember finding lots of little
              > errors in the renderer (eg. underlining glitches for double-width
              > characters) that went away with utf-8, and I don't think Vim renders
              > correctly at all if eg. "encoding" is set to "cp1242" and the ACP
              > is CP932 (needs a double conversion).
              >
              > --
              > Glenn Maynard

              UTF-8 is fully supported (well, almost fully: characterwise
              bidirectionality, a Unicode property, isn't supported) internally by
              multi-byte versions of gvim, but switching over "transparently" from
              "locale-oriented" to "Unicode-oriented" working requires careful attention
              to several options, foremost of which are 'termencoding' and
              'fileencodings'. To help the ordinary Vim user make that switchover
              "transparently" without (as we say in French) "getting his feet caught in
              the carpet", I uploaded a few minutes ago a new script called set_utf8.vim :
              go see it at http://vim.sourceforge.net/scripts/script.php?script_id=789 .
              With it and a Unicode-enabled version of Vim (with no need for any special
              patches), switching over from one's national locale to Unicode becomes a
              one-liner (you may call it a "trivial fix"). The idea of that script is to
              work as "transparently" as possible, e.g., to avoid messing up the existing
              keyboard's or (if possible) printer's interpretation of accented characters.

              Regards,
              Tony.
            • Glenn Maynard
              ... My suggestion was that these be the default settings in Windows, not be settings that the user has to fix. ... This sounds like a bug. The input from
              Message 6 of 29 , Oct 12, 2003
              • 0 Attachment
                On Mon, Oct 13, 2003 at 02:41:25AM +0200, Tony Mechelynck wrote:
                > Trivial or not, my opinion is that handling files and keypresses as per the
                > locale shouldn't be a "fix", it should be the (program) default. The "minor
                > fix" consists of making Unicode the (user's) default by means of a config
                > setting; but see below about that.

                My suggestion was that these be the default settings in Windows, not be
                settings that the user has to fix.

                > Sorry, but it is. AFAIK, leaving 'termencoding' empty when switching
                > 'encoding' over from something else to Unicode produces dysfunctions in the
                > keyboard for all users whose actual keyboard encoding is other than 7-bit
                > ASCII -- roughly speaking, for all users with a keyboard for a language
                > other than English (even Dutchmen like Bram need, as a minimum, the
                > "lowercase e with diaeresis", which is over 128, and therefore receives a
                > different representation in UTF-8 and in other encodings -- the codepoint
                > number maybe the same but it is not represented identically). That's why the
                > lines

                This sounds like a bug. The input from Windows is always in the system
                encoding (ACP) or Unicode. So, either termencoding should be ignored,
                or (if someone actually has a real use for changing it in Windows) it should
                default to the appropriate codepage, as I suggested.

                > code, with (AFAIK) no possibility of repair in mainline Vim (which hasn't
                > got the getacp() function -- and don't talk to me about a patch, I don't
                > want to use other than standard binaries; for one thing, I don't have a

                Um, the entire purpose of a patch is for it to be integrated into
                mainline Vim.

                However, the "code" I showed was just to demonstrate what I believe the
                defaults should look like. They'd actually be set in the source, not as
                Vim commands. The "getacp()" call only makes it *possible* to do that
                with Vim commands (which is useful itself).

                > Users who only edit files in a single 8 bit encoding don't need to bother
                > about Unicode. For others, it is a useful choice, but I maintain that it
                > should remain a choice, and, if the locale set in the operating system is
                > not a Unicode one, it should IMHO remain a conscious choice (or at least a
                > voluntary one, that need not stay conscious once it has been written into
                > the vimrc).

                Users, for the most part, don't care what the internal representation
                is. Many users don't even know what an encoding is (and shouldn't have
                to). I've seen little reason for UTF-8 to not eventually be the default
                internal encoding for Vim in Windows, once the remaining issues are
                resolved.

                The only interesting, fundamental reason I've seen is memory usage: UTF-8
                uses more memory for many languages.

                > UTF-8 is fully supported (well, almost fully: characterwise
                > bidirectionality, a Unicode property, isn't supported) internally by

                Not quite. It won't convert from UTF-8 to the ACP or Unicode when
                calling Windows API functions. For example, if I open files with
                kanji in the filename and enc=utf-8, the title bar has <12><34> garbage
                in it. Minimally, this should convert the string to CP932.

                In any case, I'm not about to crusade for this. I'm mostly interested in
                seeing the bugs where functionality is broken when enc=utf-8 be fixed,
                such as the title bar issue. I'd like to be able to say "use enc=utf-8
                internally and it'll fix your problems", which I can't--because it
                introduces new ones.

                --
                Glenn Maynard
              • Tony Mechelynck
                ... I understood you as meaning that the program-default setting should be Unicode. I beg to differ, however. Or maybe I misunderstood what you were saying.
                Message 7 of 29 , Oct 12, 2003
                • 0 Attachment
                  Glenn Maynard <glenn@...> wrote:
                  > On Mon, Oct 13, 2003 at 02:41:25AM +0200, Tony Mechelynck wrote:
                  > > Trivial or not, my opinion is that handling files and keypresses as
                  > > per the locale shouldn't be a "fix", it should be the (program)
                  > > default. The "minor fix" consists of making Unicode the (user's)
                  > > default by means of a config setting; but see below about that.
                  >
                  > My suggestion was that these be the default settings in Windows, not
                  > be settings that the user has to fix.

                  I understood you as meaning that the program-default setting should be
                  Unicode. I beg to differ, however. Or maybe I misunderstood what you were
                  saying. And whatever the program-default settings, Vim should (IMHO) work in
                  as constant a manner as possible across all platforms.
                  >
                  [...]
                  > > Sorry, but it is. AFAIK, leaving 'termencoding' empty when switching
                  > > 'encoding' over from something else to Unicode produces
                  > > dysfunctions in the keyboard for all users whose actual keyboard
                  > > encoding is other than 7-bit ASCII -- roughly speaking, for all
                  > > users with a keyboard for a language other than English (even
                  > > Dutchmen like Bram need, as a minimum, the "lowercase e with
                  > > diaeresis", which is over 128, and therefore receives a different
                  > > representation in UTF-8 and in other encodings -- the codepoint
                  > > number maybe the same but it is not represented identically).
                  > > That's why the lines
                  >
                  > This sounds like a bug. The input from Windows is always in the
                  > system encoding (ACP) or Unicode. So, either termencoding should be
                  > ignored,
                  > or (if someone actually has a real use for changing it in Windows) it
                  > should default to the appropriate codepage, as I suggested.

                  It doesn't sound like a bug to me, but as a musunderstanding between Windows
                  and Vim as they suddenly aren't "speaking ther same language" anymore. Let's
                  spell out what I mean with an example:

                  Let's say I press a "lowercase e with acute accent" (by far the most
                  frequent accented letter in French, my mother language). On my keyboard it's
                  the unshifted 2 key above the alphabet keys, but that doesn't matter much.
                  Under (let's say) latin1 locale, Windows makes the byte 0xE9 available to
                  gvim. The latter (in Insert mode and with latin1 'encoding') writes an
                  e-acute into the buffer I'm correctly editing. This is correct behaviour.

                  Now let's say I change 'encoding' to "utf-8". With 'termencoding' left empty
                  (the default), gvim now suddenly expects the keyboard to be sending UTF-8
                  byte sequences (because an empty 'termencoding' means it takes the same
                  value as whatever is the current vazlue of 'encoding'). Windows, however, is
                  not aware of any changes. It still sends 0xE9 for e-acute. Vim sees this,
                  and since it is a valid header byte for a 3-byte UTF-8 sequence, it expects
                  2 bytes in the range 0x80-0xBF following it. When they are not forthcoming,
                  Vim puts the 0xE9 in the buffer, interprets it as invalid, and displays it
                  as <E9>.

                  However, if I take the precaution of first saving the older 'encoding' in
                  'termencoding', then I may change 'encoding' to UTF-8 with no ill effects:
                  gvim still expects latin1 from the keyboard, and when it reads 0xE9, it
                  correctly interprets it as e-acute, and represents it internally as the
                  UTF-8 byte sequence 0xC3 0xA9, which represents the codepoint U+00E9 "LATIN
                  SMALL E WITH ACUTE".

                  Note: My W98 system can set a variety of "national keyboards" -- I can even
                  type Arabic in WordPad -- but they're a hassle because there is no
                  correspondence between what is printed on the keys of my Belgian AZERTY
                  keyboard and what those "national keyboards" send. At least, with Vims
                  keymaps, I can design any number of keymaps to suit me, and, for instance,
                  map the Russian deh or the Arabic daal to the Latin D key, which makes sense
                  to me but does not necessarily correspond to where Russian or Arabic people
                  expect their D key to be. AFAIK I cannot choose Unicode as the "national
                  keyboard" (and, in fact, I don't need to, since it's easier for me to keep
                  Windows set to French language with Belgian AZERTY keyboard, and let gvim
                  handle non-Latin encodings by means of keymaps, digraphs, and/or the
                  i_CTRL-V_digit capability).
                  >
                  > > code, with (AFAIK) no possibility of repair in mainline Vim (which
                  > > hasn't got the getacp() function -- and don't talk to me about a
                  > > patch, I don't want to use other than standard binaries; for one
                  > > thing, I don't have a
                  >
                  > Um, the entire purpose of a patch is for it to be integrated into
                  > mainline Vim.
                  >
                  > However, the "code" I showed was just to demonstrate what I believe
                  > the defaults should look like. They'd actually be set in the source,
                  > not as
                  > Vim commands. The "getacp()" call only makes it *possible* to do that
                  > with Vim commands (which is useful itself).

                  It may be useful in itself; but until and unless it is indeed (as you
                  suggest) incorporated in mainline Vim source (a possibility towards which
                  I'm not averse as long as it doesn't break something else), it "doesn't
                  exist" from where I sit.
                  >
                  > > Users who only edit files in a single 8 bit encoding don't need to
                  > > bother about Unicode. For others, it is a useful choice, but I
                  > > maintain that it should remain a choice, and, if the locale set in
                  > > the operating system is not a Unicode one, it should IMHO remain a
                  > > conscious choice (or at least a voluntary one, that need not stay
                  > > conscious once it has been written into the vimrc).
                  >
                  > Users, for the most part, don't care what the internal representation
                  > is. Many users don't even know what an encoding is (and shouldn't
                  > have
                  > to). I've seen little reason for UTF-8 to not eventually be the
                  > default internal encoding for Vim in Windows, once the remaining
                  > issues are
                  > resolved.
                  >
                  > The only interesting, fundamental reason I've seen is memory usage:
                  > UTF-8 uses more memory for many languages.

                  Indeed. The difference is virtually nil for English; it is small but nonzero
                  for other Latin-alphabet languages, it approaches 1 to 2 for other-alphabet
                  languages like Greek or Russian (a little less than that because of spaces,
                  commas, full stops, etc.); I don't know the ratio for languages like hindi
                  (with nagari script) or Chinese (hanzi).
                  >
                  > > UTF-8 is fully supported (well, almost fully: characterwise
                  > > bidirectionality, a Unicode property, isn't supported) internally by
                  >
                  > Not quite. It won't convert from UTF-8 to the ACP or Unicode when
                  > calling Windows API functions. For example, if I open files with
                  > kanji in the filename and enc=utf-8, the title bar has <12><34>
                  > garbage
                  > in it. Minimally, this should convert the string to CP932.
                  >
                  > In any case, I'm not about to crusade for this. I'm mostly
                  > interested in seeing the bugs where functionality is broken when
                  > enc=utf-8 be fixed,
                  > such as the title bar issue. I'd like to be able to say "use
                  > enc=utf-8 internally and it'll fix your problems", which I
                  > can't--because it
                  > introduces new ones.
                  >
                  > --
                  > Glenn Maynard

                  I see. My script won't fix the problems caused by kanji in filenames
                  (personally I tend to shy away from anything other than us-ascii in
                  filenames anyway; I have, however, some e-acutes in filenames automatically
                  generated by Windows) but if you look at it, you'll see that it will make
                  Unicode use easier (with, IMHO, little hassle and good transparency) for the
                  average user of currently existing out-of-the-box multibyte versions of Vim.
                  Having kanji in filenames display correctly on the titlebar (and, why not,
                  on the status bar too) should be a separate fix, which ought to have no
                  (positive or negative) influence on the workings of my script.

                  By the way: what do you mean by ACP? The currently "active code page" maybe?

                  Hm. Your "kanji in filenames" issue makes me think: could that be related to
                  the fact that my Netscape 7 cannot properly handle Cyrillic letters between
                  <title></title> HTML tags (what sits there displays on the title bar, and
                  anything out-of-the-way is accepted but doesn't display properly, IIRC not
                  even with a <meta> tag specifying that the page is in UTF-8) but can show
                  them with no problems in body text, for instance between <H1></H1> (where
                  the title could appear again, this time to be displayed on top of the text
                  inside the browser window)? But this paragraph may be drifting off-topic.

                  Best regards,
                  Tony.
                • Glenn Maynard
                  ... I believe that the *internal* encoding ( encoding ) can, if the various bugs are fixed, reasonably be UTF-8, unless there s outcry about memory usage. I
                  Message 8 of 29 , Oct 12, 2003
                  • 0 Attachment
                    On Mon, Oct 13, 2003 at 05:21:04AM +0200, Tony Mechelynck wrote:
                    > I understood you as meaning that the program-default setting should be
                    > Unicode. I beg to differ, however. Or maybe I misunderstood what you were
                    > saying. And whatever the program-default settings, Vim should (IMHO) work in
                    > as constant a manner as possible across all platforms.

                    I believe that the *internal* encoding ("encoding") can, if the various
                    bugs are fixed, reasonably be UTF-8, unless there's outcry about memory
                    usage. I agree that it's very important that keyboard input, file
                    reading and writing, and so on operate in the ACP by default.

                    > Now let's say I change 'encoding' to "utf-8". With 'termencoding' left empty
                    > (the default), gvim now suddenly expects the keyboard to be sending UTF-8
                    > byte sequences (because an empty 'termencoding' means it takes the same
                    > value as whatever is the current vazlue of 'encoding'). Windows, however, is

                    Right: I believe this is poor behavior for Windows. Windows input is
                    always in the ACP[1], and if it's not, it should always be possible to find
                    out what it is. (That is, I don't know exactly what Windows does if you
                    have multiple keyboard mappings and change languages, but it shouldn't
                    require special changing of tenc.)

                    For example, Vim always expects data from the IME in the encoding it
                    sends (Unicode). termencoding is not used. If I set tenc=cp1242, I
                    can still enter Japanese kanji with the IME--Vim knows that data is
                    alwyas in the same format, and handles it correctly, even though it's
                    not CP1242. Keyboard input is the same: the encoding should always
                    be predictable.

                    (I don't know if anyone is using tenc in Windows to do weird things;
                    I can't think of any practical use for intentionally setting tenc to
                    a value that doesn't match the ACP.)

                    > It may be useful in itself; but until and unless it is indeed (as you
                    > suggest) incorporated in mainline Vim source (a possibility towards which
                    > I'm not averse as long as it doesn't break something else), it "doesn't
                    > exist" from where I sit.

                    That's nice, but not relevant. :) Again, I wasn't suggesting anyone
                    use the Vim script I supplied, but only using it to demonstrate what the
                    internal defaults could be.

                    > Indeed. The difference is virtually nil for English; it is small but nonzero
                    > for other Latin-alphabet languages, it approaches 1 to 2 for other-alphabet
                    > languages like Greek or Russian (a little less than that because of spaces,
                    > commas, full stops, etc.); I don't know the ratio for languages like hindi
                    > (with nagari script) or Chinese (hanzi).

                    The penalty is about 50% for CJK languages (two byte encodings become
                    three byte sequences).

                    > By the way: what do you mean by ACP? The currently "active code page" maybe?

                    ANSI codepage. It's the system codepage, set in the "regional settigs"
                    control panel (or whatever; MS changes the control panels weekly). It's
                    the codepage that "*A" (ANSI) functions expect (which are the ones Vim
                    uses, for the most part). Essentially, the ACP is to Windows 9x as
                    "encoding" is to Vim. In NT, everything is UCS-16 internally--or
                    is it UTF-16?--and the "*A" functions convert to and from the ACP.

                    In a sense, MS did with NT what I wish Vim would do--standardize on Unicode
                    internally, to make the internals simpler, in a way that is transparent
                    to users.

                    > Hm. Your "kanji in filenames" issue makes me think: could that be related to
                    > the fact that my Netscape 7 cannot properly handle Cyrillic letters between
                    > <title></title> HTML tags (what sits there displays on the title bar, and
                    > anything out-of-the-way is accepted but doesn't display properly, IIRC not
                    > even with a <meta> tag specifying that the page is in UTF-8) but can show
                    > them with no problems in body text, for instance between <H1></H1> (where
                    > the title could appear again, this time to be displayed on top of the text
                    > inside the browser window)? But this paragraph may be drifting off-topic.

                    It's related, but not exactly the same.

                    Vim's problem with titlebars is that it's not converting titlebar
                    strings to the ACP. ("桜.txt" shows up as <8d><f7>.txt, and 8df7
                    looks like the Unicode value of 桜; I'm not entirely sure how that's
                    happening and havn't looked at the code.) Fixing this will allow
                    displaying characters in the ANSI codepage: a system set to Japanese
                    will be able to display Kanji, but not Arabic.

                    For displaying full Unicode, it needs to test if Unicode is available,
                    create a Unicode window (instead of an ANSI window), and set the title
                    with the corresponding wide function. This isn't too hard, but it does
                    take more work and a great deal more testing (to make sure it doesn't
                    break anything in 9x). This would be nice, but it's above and beyond
                    "don't break anything in UTF-8 that works in the normal ANSI codepage".

                    Whoops. I just tried saving "桜.txt", and ended up with "(garbage)÷.txt".
                    That explains the "<8d><f7>.txt". Looks like file saving isn't working
                    right when enc=utf-8. This is a much more serious bug, but not one I'm
                    up to fixing right now, as, like you, I rarely edit files with non-ASCII
                    characters in the filename. (I'm still using 6.1, though, so this might
                    well be fixed.)

                    [1] or in Unicode in NT if you use the correct Windows messages, but I
                    don't recall which of those work in 9x (probably none)

                    --
                    Glenn Maynard
                  • Tony Mechelynck
                    ... so, IIUC, if we want to keep keyboard input, printer output, and file creation to operate by default according to the geographic locale, then one thing
                    Message 9 of 29 , Oct 12, 2003
                    • 0 Attachment
                      Glenn Maynard <glenn@...> wrote:
                      > On Mon, Oct 13, 2003 at 05:21:04AM +0200, Tony Mechelynck wrote:
                      > > I understood you as meaning that the program-default setting should
                      > > be Unicode. I beg to differ, however. Or maybe I misunderstood what
                      > > you were saying. And whatever the program-default settings, Vim
                      > > should (IMHO) work in as constant a manner as possible across all
                      > > platforms.
                      >
                      > I believe that the *internal* encoding ("encoding") can, if the
                      > various
                      > bugs are fixed, reasonably be UTF-8, unless there's outcry about
                      > memory usage. I agree that it's very important that keyboard input,
                      > file
                      > reading and writing, and so on operate in the ACP by default.

                      so, IIUC, if we want to keep keyboard input, printer output, and file
                      creation to operate by default according to the geographic locale, then one
                      thing that I can see is that 'termencoding' cannot default to empty (as it
                      can when 'encoding' defaults to the encoding defined by $LANG), it must
                      default to the keyboard's national encoding. Similarly for 'printencoding'
                      (where present and functioning), for the global side of 'fileencoding', and
                      for the non-Unicode part of 'fileencodings', which could then for instance
                      be set by default to "ucs-bom,utf-8,cp937" if cp937 is the "national"
                      encoding as defined by the Windows country settings.
                      >
                      > > Now let's say I change 'encoding' to "utf-8". With 'termencoding'
                      > > left empty (the default), gvim now suddenly expects the keyboard to
                      > > be sending UTF-8 byte sequences (because an empty 'termencoding'
                      > > means it takes the same value as whatever is the current vazlue of
                      > > 'encoding'). Windows, however, is
                      >
                      > Right: I believe this is poor behavior for Windows. Windows input is
                      > always in the ACP[1], and if it's not, it should always be possible
                      > to find out what it is. (That is, I don't know exactly what Windows
                      > does if you
                      > have multiple keyboard mappings and change languages, but it shouldn't
                      > require special changing of tenc.)

                      WordPad is somehow able to detect it "on the fly" when I change the setting
                      of the "international keyboard" feature. AFAIK, Vim isn't, so it's simpler
                      not to touch that feature when working with Vim. OTOH, as long as
                      'termencoding' is nonempty and consistent with what the keyboard driver is
                      sending to the program, the internal 'encoding' of gvim can be changed to
                      anything compatible with what I'm doing, and in particular to UTF-8, which
                      ought to be compatible with everything (within limits: I mustn't set
                      'fileencoding' to latin1, for instance, if I've typed kanji into the
                      buffer).
                      >
                      > For example, Vim always expects data from the IME in the encoding it
                      > sends (Unicode). termencoding is not used. If I set tenc=cp1242, I
                      > can still enter Japanese kanji with the IME--Vim knows that data is
                      > alwyas in the same format, and handles it correctly, even though it's
                      > not CP1242. Keyboard input is the same: the encoding should always
                      > be predictable.

                      I see. I think I have Window's Global IME installed, but I don't know how to
                      use it -- how, for instance, to input an East-Asian ideogram, of which I
                      know the shape, and maybe the meaning or part of it, but not the sound. For
                      "ordinary" text input, or for keymapped text input, Vim interprets the keys
                      coming from the keyboard driver in the light of the current 'termencoding'.
                      >
                      > (I don't know if anyone is using tenc in Windows to do weird things;
                      > I can't think of any practical use for intentionally setting tenc to
                      > a value that doesn't match the ACP.)

                      Neither can I. That's why it shouldn't stay empty if and when 'encoding' is
                      changed away from the ACP.
                      >
                      > > It may be useful in itself; but until and unless it is indeed (as
                      > > you suggest) incorporated in mainline Vim source (a possibility
                      > > towards which I'm not averse as long as it doesn't break something
                      > > else), it "doesn't exist" from where I sit.
                      >
                      > That's nice, but not relevant. :) Again, I wasn't suggesting anyone
                      > use the Vim script I supplied, but only using it to demonstrate what
                      > the internal defaults could be.
                      > [...]
                      > > Indeed. The difference is virtually nil for English; it is small
                      > > but nonzero for other Latin-alphabet languages, it approaches 1 to
                      > > 2 for other-alphabet languages like Greek or Russian (a little less
                      > > than that because of spaces, commas, full stops, etc.); I don't
                      > > know the ratio for languages like hindi (with nagari script) or
                      > > Chinese (hanzi).
                      >
                      > The penalty is about 50% for CJK languages (two byte encodings become
                      > three byte sequences).
                      >
                      > > By the way: what do you mean by ACP? The currently "active code
                      > > page" maybe?
                      >
                      > ANSI codepage. It's the system codepage, set in the "regional
                      > settigs" control panel (or whatever; MS changes the control panels
                      > weekly). It's
                      > the codepage that "*A" (ANSI) functions expect (which are the ones Vim
                      > uses, for the most part). Essentially, the ACP is to Windows 9x as
                      > "encoding" is to Vim. In NT, everything is UCS-16 internally--or
                      > is it UTF-16?--and the "*A" functions convert to and from the ACP.

                      You can call it UCS-2 or UTF-16. I've been told there are a few differences
                      between the two, but IIUC they won't show themselves if you limit yourself
                      to valid codepoints not higher than U+FFFF.
                      >
                      > In a sense, MS did with NT what I wish Vim would do--standardize on
                      > Unicode internally, to make the internals simpler, in a way that is
                      > transparent
                      > to users.
                      >
                      > > Hm. Your "kanji in filenames" issue makes me think: could that be
                      > > related to the fact that my Netscape 7 cannot properly handle
                      > > Cyrillic letters between <title></title> HTML tags (what sits there
                      > > displays on the title bar, and anything out-of-the-way is accepted
                      > > but doesn't display properly, IIRC not even with a <meta> tag
                      > > specifying that the page is in UTF-8) but can show them with no
                      > > problems in body text, for instance between <H1></H1> (where the
                      > > title could appear again, this time to be displayed on top of the
                      > > text inside the browser window)? But this paragraph may be drifting
                      > > off-topic.
                      >
                      > It's related, but not exactly the same.
                      >
                      > Vim's problem with titlebars is that it's not converting titlebar
                      > strings to the ACP. ("桜.txt" shows up as <8d><f7>.txt, and 8df7
                      > looks like the Unicode value of 桜; I'm not entirely sure how that's
                      > happening and havn't looked at the code.) Fixing this will allow
                      > displaying characters in the ANSI codepage: a system set to Japanese
                      > will be able to display Kanji, but not Arabic.

                      ...and a system set (like mine) to a Latin codepage will be able to display
                      French (with its accents), but not Russian. That sheds some light on what I
                      experienced.
                      >
                      > For displaying full Unicode, it needs to test if Unicode is available,
                      > create a Unicode window (instead of an ANSI window), and set the title
                      > with the corresponding wide function. This isn't too hard, but it
                      > does
                      > take more work and a great deal more testing (to make sure it doesn't
                      > break anything in 9x). This would be nice, but it's above and beyond
                      > "don't break anything in UTF-8 that works in the normal ANSI
                      > codepage".

                      ...and it would probably add quite some lines of code for cross-platform
                      compatibility, since not every platform offers a full Unicode interface.
                      >
                      > Whoops. I just tried saving "桜.txt", and ended up with
                      > "(garbage)÷.txt". That explains the "<8d><f7>.txt". Looks like file
                      > saving isn't working
                      > right when enc=utf-8. This is a much more serious bug, but not one
                      > I'm
                      > up to fixing right now, as, like you, I rarely edit files with
                      > non-ASCII characters in the filename. (I'm still using 6.1, though,
                      > so this might
                      > well be fixed.)

                      Can you create that filename with Notepad.exe (Save As) or cmd.exe (copy NUL
                      filename.txt)? If not, then Vim is no worse than at least some native
                      Microsoft applications. I suppose you know (but I'm repeating) that a
                      full-featured gvim distribution for Win32 (currently gvim.exe 6.2.96 plus
                      runtime files as of 13 Sep 2003) is available from Steve Hall at
                      http://cream.sourceforge.net/vim.html . It's the most recent gvim
                      distribution for Windows known to me, with what I regard as quite a
                      user-friendly installer. It is also a "standard" gvim, not a "special Cream"
                      gvim, notwithstanding its hosting location. (And it's the one I'm using,
                      which doesn't say much, except that I can attest that I have found it to
                      work the way the help files say it should. Of course I haven't tested every
                      possible little thing though.) Finally, if it happens in the future as it
                      did in the past, Steve will continue to generate updated gvim builds from
                      time to time, and the above-mentioned page will be updated accordingly.
                      >
                      > [1] or in Unicode in NT if you use the correct Windows messages, but I
                      > don't recall which of those work in 9x (probably none)
                      >
                      > --
                      > Glenn Maynard

                      Best regards,
                      Tony.
                    • Glenn Maynard
                      ... That s what I was suggesting originally, I just wasn t clear enough. ... (Right, but the difference is significant, so I just wanted to make it clear that
                      Message 10 of 29 , Oct 12, 2003
                      • 0 Attachment
                        On Mon, Oct 13, 2003 at 07:28:24AM +0200, Tony Mechelynck wrote:
                        > so, IIUC, if we want to keep keyboard input, printer output, and file
                        > creation to operate by default according to the geographic locale, then one
                        > thing that I can see is that 'termencoding' cannot default to empty (as it
                        > can when 'encoding' defaults to the encoding defined by $LANG), it must
                        > default to the keyboard's national encoding. Similarly for 'printencoding'
                        > (where present and functioning), for the global side of 'fileencoding', and
                        > for the non-Unicode part of 'fileencodings', which could then for instance
                        > be set by default to "ucs-bom,utf-8,cp937" if cp937 is the "national"
                        > encoding as defined by the Windows country settings.

                        That's what I was suggesting originally, I just wasn't clear enough.

                        > You can call it UCS-2 or UTF-16. I've been told there are a few differences
                        > between the two, but IIUC they won't show themselves if you limit yourself
                        > to valid codepoints not higher than U+FFFF.

                        (Right, but the difference is significant, so I just wanted to make it
                        clear that I wasn't being precise.)

                        > ...and it would probably add quite some lines of code for cross-platform
                        > compatibility, since not every platform offers a full Unicode interface.

                        Vim already has the necessary code to convert between UTF-8 and the ACP,
                        without adding any dependencies like iconv.

                        > Can you create that filename with Notepad.exe (Save As) or cmd.exe (copy NUL
                        > filename.txt)? If not, then Vim is no worse than at least some native
                        > Microsoft applications. I suppose you know (but I'm repeating) that a

                        I can create it with notepad, and any other native graphical app that is
                        packaged with Windows. (I can also create files with filenames in any
                        language; Windows-native apps in NT are completely Unicode-based.)

                        I can also create it with Vim if encoding is set to CP932; this only
                        happens enc=utf-8.

                        --
                        Glenn Maynard
                      • Camillo Särs
                        ... Correct. Additionally, you can always enter any unicode character code directly from the keyboard. All that is needed is the numeric keypad in numlock
                        Message 11 of 29 , Oct 13, 2003
                        • 0 Attachment
                          Glenn Maynard wrote:
                          >>Can you create that filename with Notepad.exe (Save As) or cmd.exe (copy NUL
                          >>filename.txt)? If not, then Vim is no worse than at least some native
                          >>Microsoft applications. I suppose you know (but I'm repeating) that a
                          >
                          > I can create it with notepad, and any other native graphical app that is
                          > packaged with Windows. (I can also create files with filenames in any
                          > language; Windows-native apps in NT are completely Unicode-based.)

                          Correct. Additionally, you can always enter any unicode character code
                          directly from the keyboard. All that is needed is the numeric keypad in
                          numlock mode and the Alt key. This does not seem to work with Vim.

                          > I can also create it with Vim if encoding is set to CP932; this only
                          > happens enc=utf-8.

                          That's what I noted as well. Basically vim works "ok" if you set the
                          termencoding and encoding to your codepage. However, you don't get UTF-8
                          support that way. Things break down on Windows when you use UTF-8 as your
                          encoding, as vim seems to use incorrect APIs.

                          And as noted, Win9x/ME are different. I'm only concerned with NT-based
                          Windows here, as that's where you can expect Unicode support to work.

                          To summarize:
                          - Vim on NT does not work well with unicode/utf-8.
                          - The fixes are fairly straightforward (use Unicode API, UTF-8 internally)
                          - Win9x need to work in cp mode, but that's already supported

                          Camillo
                          --
                          Camillo Särs <+ged+@...> ** Aim for the impossible and you
                          <http://www.iki.fi/+ged> ** will achieve the improbable.
                          PGP public key available **
                        • Glenn Maynard
                          ... They don t break down, they re just imperfect. ... Vim should support UTF-8 in 9x, too. ... It works well for many uses; I use enc=utf-8 exclusively, to
                          Message 12 of 29 , Oct 13, 2003
                          • 0 Attachment
                            On Mon, Oct 13, 2003 at 10:24:01AM +0300, Camillo Särs wrote:
                            > That's what I noted as well. Basically vim works "ok" if you set the
                            > termencoding and encoding to your codepage. However, you don't get UTF-8
                            > support that way. Things break down on Windows when you use UTF-8 as your
                            > encoding, as vim seems to use incorrect APIs.

                            They don't break down, they're just imperfect.

                            > And as noted, Win9x/ME are different. I'm only concerned with NT-based
                            > Windows here, as that's where you can expect Unicode support to work.

                            Vim should support UTF-8 in 9x, too.

                            > - Vim on NT does not work well with unicode/utf-8.

                            It works well for many uses; I use enc=utf-8 exclusively, to edit files
                            in both UTF-8 (with characters well beyond CP1242 and CP932) and other
                            encodings.

                            > - The fixes are fairly straightforward (use Unicode API, UTF-8 internally)
                            > - Win9x need to work in cp mode, but that's already supported

                            No, convert between UTF-8 and the ACP and use the ANSI API calls. This
                            will make enc=utf-8 work in both 9x and NT.

                            Using Unicode calls when available is useful (eg. to display non-ACP
                            text in the titlebar), but that's "new feature" territory, not "bugfix".

                            --
                            Glenn Maynard
                          • Bram Moolenaar
                            ... The default that Vim starts with is encoding set to the active codepage and fileencoding set to ucs-bom . This means it falls back to encoding when
                            Message 13 of 29 , Oct 13, 2003
                            • 0 Attachment
                              Glenn Maynard wrote:

                              > On Sun, Oct 12, 2003 at 10:44:05PM +0200, Tony Mechelynck wrote:
                              > > As long as 'fileencoding', 'printencoding' and (most important)
                              > > 'termencoding' default (when empty) to whatever is the current value of
                              > > 'encoding', the latter must not (IMHO) be set to UTF-8 by default.
                              > >
                              > > (Let's spell it out) In my humble opinion, Vim should require as little
                              > > "tuning" as possible to handle the language interfaces the same way as the
                              > > operating system does, and this means that, when the user sets nothing else
                              > > in his startup and configuration files, keyboard input, printer output and
                              > > file creation should default to whatever is set in the locale.
                              >
                              > This is a trivial fix, which I already proposed many months ago: the
                              > defaults in Windows should be the results of
                              >
                              > exe "set fileencodings=ucs-bom,utf-8,cp" . getacp() . ",latin1"
                              > exe "set fileencoding=cp" . getacp()
                              >
                              > and now adding:
                              >
                              > exe "set printencoding=cp" . getacp()

                              The default that Vim starts with is 'encoding' set to the active
                              codepage and 'fileencoding' set to "ucs-bom". This means it falls back
                              to 'encoding' when there is no BOM. That should work almost the same
                              way as what you give here, but without the explicit use of the codepage
                              name. When the user sets 'encoding' the other ones follow. In your
                              example the user has to set all three options.

                              Perhaps setting 'termencoding' can be omitted if we can use the Unicode
                              functions for keyboard input. Perhaps someone can figure out how to do
                              this properly. And make use the input methods still work!

                              > Note that "getacp" is a function in a patch I sent which was lost or
                              > forgotton: return the ANSI codepage.

                              Can't recall that patch. I generally give OS-specific additions a low
                              priority.

                              > Switching "encoding" to "utf-8" should be transparent, once proper
                              > conversions for win32 calls are in place. Regular users don't care
                              > about what encoding their editor uses internally, any more than they
                              > care about what type of data structures they use.

                              The problem still is that conversion from and to UTF-8 is not
                              transparent. Especially when editing files with an unknown encoding.

                              > On the other hand, if utf-8 internally is fully supported, then utf-8
                              > can be the *only* internal encoding--which would make the rendering
                              > code much simpler and more robust. I remember finding lots of little
                              > errors in the renderer (eg. underlining glitches for double-width
                              > characters) that went away with utf-8, and I don't think Vim renders
                              > correctly at all if eg. "encoding" is set to "cp1242" and the ACP
                              > is CP932 (needs a double conversion).

                              UTF-8 is already fully supported in Vim. They may be a few glitches on
                              the conversions though. The clipboard also still doesn't work 100%.

                              --
                              hundred-and-one symptoms of being an internet addict:
                              182. You may not know what is happening in the world, but you know
                              every bit of net-gossip there is.

                              /// Bram Moolenaar -- Bram@... -- http://www.Moolenaar.net \\\
                              /// Creator of Vim - Vi IMproved -- http://www.Vim.org \\\
                              \\\ Project leader for A-A-P -- http://www.A-A-P.org ///
                              \\\ Help AIDS victims, buy here: http://ICCF-Holland.org/click1.html ///
                            • Camillo Särs
                              ... Well, if I can t write a filename the way I need to write it, I have a problem. Fortunately this is mostly theoretic for me, but for some users resorting
                              Message 14 of 29 , Oct 13, 2003
                              • 0 Attachment
                                Glenn Maynard wrote:
                                > They don't break down, they're just imperfect.

                                Well, if I can't write a filename the way I need to write it, I have a
                                problem. Fortunately this is mostly theoretic for me, but for some users
                                resorting to plain us-ascii is not a possibility. These mails are more an
                                attempt at getting vim to work better than to improve my life. After all,
                                I believe in contributing when I can, if only by highlighting problems and
                                proposing solutions.

                                > Vim should support UTF-8 in 9x, too.

                                Of course, but with the necessary restrictions. Displaying unicode is a
                                problem, as is entering filenames. Those functions are restricted to the
                                ACP on Win9x.

                                >>- Vim on NT does not work well with unicode/utf-8.
                                >
                                > It works well for many uses; I use enc=utf-8 exclusively, to edit files
                                > in both UTF-8 (with characters well beyond CP1242 and CP932) and other
                                > encodings.

                                Yes, editing is not the problem. It's the system calls that cause the
                                trouble, as we have established.

                                >>- The fixes are fairly straightforward (use Unicode API, UTF-8 internally)
                                >>- Win9x need to work in cp mode, but that's already supported
                                >
                                > No, convert between UTF-8 and the ACP and use the ANSI API calls. This
                                > will make enc=utf-8 work in both 9x and NT.

                                No it will not. You would then restrict NT users to their local code page
                                only, and that's almost "reverting to DOS". On Win9x we need to stick to
                                ACP, but on NT I don't see any reason not to go Unicode. Also, the UTF-8
                                to UCS-2 mapping is quick and straightforward, with few hidden catches.
                                Mapping utf-8 to ACP is tricky and lossy.

                                Also, the code you had implemented already used the "W" APIs correctly. I
                                don't understand why you would now advocate dropping widechar and unicode
                                support.

                                > Using Unicode calls when available is useful (eg. to display non-ACP
                                > text in the titlebar), but that's "new feature" territory, not "bugfix".

                                It is a bugfix. Currently, when using UTF-8 on WinNT, vim is broken in (at
                                least) the following regards:

                                - Opening non-ascii filenames, regardless of codepage
                                å.txt internally becomes <e5>.txt

                                - Saving filenames
                                å.txt is saved in UTF-8 format (Ã¥.txt) and displayed incorrectly in
                                title bar

                                - The default termencoding should be set intelligently, UTF-8 as
                                termencoding breaks input of non-ascii.

                                - The default fileencoding breaks when "going UTF-8", most probably a
                                better behavior would be to default to the ACP always.

                                - Also, my vim (6.2) defaults to "latin1", not my current codepage. That
                                would indicate that the ACP detection does not work.

                                OK, the list above sounds like whining, but earlier I did suggest that the
                                fixes are fairly straightforward.

                                On WinNT, vim should use unicode apis, essentially benefitting
                                automatically from NT native Unicode. This only involves one additional
                                encoding/decoding step before calling the apis.

                                On Win9x, vim should use ANSI apis. The only thing missing is again the
                                encoding/decoding, although it's trickier with the ANSI apis. There are
                                many cases where an user would enter UTF-8 stuff that doesn't smootly
                                convert to the current CP. I think vim's current code should detect that
                                easily.

                                Camillo
                                --
                                Camillo Särs <+ged+@...> ** Aim for the impossible and you
                                <http://www.iki.fi/+ged> ** will achieve the improbable.
                                PGP public key available **
                              • Bram Moolenaar
                                ... On Windows NT/XP there are also restrictions, especially when using non-NTFS filesystems. There was a discussion about this in the Linux UTF-8 maillist a
                                Message 15 of 29 , Oct 13, 2003
                                • 0 Attachment
                                  Camillo wrote:

                                  > > Vim should support UTF-8 in 9x, too.
                                  >
                                  > Of course, but with the necessary restrictions. Displaying unicode is a
                                  > problem, as is entering filenames. Those functions are restricted to the
                                  > ACP on Win9x.

                                  On Windows NT/XP there are also restrictions, especially when using
                                  non-NTFS filesystems. There was a discussion about this in the Linux
                                  UTF-8 maillist a long time ago. There was no good universal solution
                                  for handling filenames that they could come up with.

                                  Vim could use Unicode functions for accessing files, but this will be a
                                  huge change. Requires lots of testing. Main problem is when 'encoding'
                                  is not a Unicode encoding, then conversions need to be done, which may
                                  fail.

                                  If you use filenames that cannot be represented in the active codepage,
                                  you probably have problems with other programs. Thus sticking with the
                                  active codepage functions isn't too bad. But then Vim needs to convert
                                  from 'encoding' to the active codepage!

                                  > It is a bugfix. Currently, when using UTF-8 on WinNT, vim is broken in (at
                                  > least) the following regards:
                                  >
                                  > - Opening non-ascii filenames, regardless of codepage
                                  > å.txt internally becomes <e5>.txt
                                  >
                                  > - Saving filenames
                                  > å.txt is saved in UTF-8 format (Ã¥.txt) and displayed incorrectly in
                                  > title bar

                                  The file names are handled as byte strings. Thus so long as you use the
                                  right bytes it should work. Problem is when you are typing/editing with
                                  a different encoding from the active codepage.

                                  > - The default termencoding should be set intelligently, UTF-8 as
                                  > termencoding breaks input of non-ascii.

                                  Why would 'termencoding' be "utf-8"? This won't work, unless you are
                                  using an xterm on MS-Windows. The default 'termencoding' is empty,
                                  which means 'encoding' is used. There is no better default. When you
                                  change 'encoding' you might have to change 'termencoding' as well, but
                                  this depends on your situation.

                                  > - The default fileencoding breaks when "going UTF-8", most probably a
                                  > better behavior would be to default to the ACP always.

                                  'fileencoding' is set when reading a file. Perhaps you mean
                                  'fileencodings'? This one needs to be tweaked by the user, because it
                                  depends on what kind of files you edit. Main problem is that an ASCII
                                  file can be any encoding, Vim can't detect what it is, thus the user has
                                  to specify what he wants Vim to do with it.

                                  > - Also, my vim (6.2) defaults to "latin1", not my current codepage. That
                                  > would indicate that the ACP detection does not work.

                                  Where does it use "latin1"? Not in 'encoding', I suppose.

                                  > OK, the list above sounds like whining, but earlier I did suggest that the
                                  > fixes are fairly straightforward.

                                  Mostly it's quite more complicated. Different users have different
                                  situations, it is hard to think of solutions that work for most people.

                                  > On WinNT, vim should use unicode apis, essentially benefitting
                                  > automatically from NT native Unicode. This only involves one additional
                                  > encoding/decoding step before calling the apis.

                                  The problem is that conversions to/from Unicode only work when you know
                                  the encoding of the text you are converting. The encoding isn't always
                                  known. Vim sometimes uses "latin1", so that you at least get 8-bit
                                  clean editing, even though the actual encoding is unknown.

                                  > On Win9x, vim should use ANSI apis. The only thing missing is again the
                                  > encoding/decoding, although it's trickier with the ANSI apis. There are
                                  > many cases where an user would enter UTF-8 stuff that doesn't smootly
                                  > convert to the current CP. I think vim's current code should detect that
                                  > easily.

                                  You can use a few Unicode functions on Win9x, we already do. I don't
                                  see a reason to change this.

                                  --
                                  I'm in shape. Round IS a shape.

                                  /// Bram Moolenaar -- Bram@... -- http://www.Moolenaar.net \\\
                                  /// Creator of Vim - Vi IMproved -- http://www.Vim.org \\\
                                  \\\ Project leader for A-A-P -- http://www.A-A-P.org ///
                                  \\\ Help AIDS victims, buy here: http://ICCF-Holland.org/click1.html ///
                                • Camillo Särs
                                  ... Right, I forgot about those. AFAIK, the fuctions do not fail silently in those cases, so it s just (yet) more work. Essentially, file names then come
                                  Message 16 of 29 , Oct 13, 2003
                                  • 0 Attachment
                                    Bram Moolenaar wrote:
                                    > On Windows NT/XP there are also restrictions, especially when using
                                    > non-NTFS filesystems.

                                    Right, I forgot about those. AFAIK, the fuctions do not fail silently in
                                    those cases, so it's just (yet) more work. Essentially, file names then
                                    come from a restricted charset (code page limits).

                                    > There was a discussion about this in the Linux UTF-8 maillist a
                                    > long time ago. There was no good universal solution
                                    > for handling filenames that they could come up with.

                                    I bet. For many systems, the current behavior is adequate even if
                                    technically speaking wrong. I'm not trying to propose a universal
                                    solution, I'm just advocating the view that on win32, vim should do the
                                    "windows thing" with unicode/utf-8.

                                    > Vim could use Unicode functions for accessing files, but this will be a
                                    > huge change.

                                    Why so? The code earlier in this thread probably did much of what is
                                    needed. It also involved numerous other changes, which I ignored. I'm not
                                    being nosy, I'm just curious why this would be a "huge change". It's not
                                    the file contents we are getting at, it's the filenames (and the GUI).

                                    Also note that when using the native code page as the encoding (read:
                                    latin1), using the ANSI functions do work as expected. So the fixes would
                                    only need to concern the UTF-8 encoding, if you get picky. :)

                                    > Requires lots of testing.

                                    That's unicode for you. However, deriving a decent test set using
                                    available unicode test files should be a fairly straight-forward thing.

                                    > Main problem is when 'encoding' is not a Unicode encoding, then conversions
                                    > need to be done, which may fail.

                                    But what I assume you are doing now is even worse, isn't it? Essentially
                                    you are be feeding some user-selected encoding to functions that require
                                    ANSI characters. How's that for "a lot of testing"?

                                    Conversions from almost any encoding to unicode should work. I would not
                                    expect major trouble there. And note that if the conversion from the
                                    encoding to unicode fails, I expect that the current usage would fail even
                                    more severely. And there haven't been reports of that, has there?

                                    There certainly are tricky encodings that could cause problems. However,
                                    I'm mostly concerned with the basic use case of utf-8 and
                                    "fileencodings=ucs-bom,utf-8,latin1". This under a code page of cp1252.

                                    > If you use filenames that cannot be represented in the active codepage,
                                    > you probably have problems with other programs.

                                    But I have filenames that can be represented in the active code page
                                    (å.txt), but which get encoded into incompatible UTF-8 characters!

                                    > Thus sticking with the active codepage functions isn't too bad.

                                    If it worked that way, but it doesn't. Setting "encoding=utf-8" changes
                                    that behavior - only us-ascii is usable in filenames.

                                    > But then Vim needs to convert from 'encoding' to the active codepage!

                                    That would help most users. Including me. But it would not be the
                                    "ultimate" solution to unicode on win32, as it would still cause trouble
                                    with characters outside the codepage. As I see it, the easiest fix is
                                    actually using the unicode-api, as there are less (or no) conversion
                                    failures that way.

                                    > The file names are handled as byte strings. Thus so long as you use the
                                    > right bytes it should work. Problem is when you are typing/editing with
                                    > a different encoding from the active codepage.

                                    My point exactly! :)

                                    > Why would 'termencoding' be "utf-8"? This won't work, unless you are
                                    > using an xterm on MS-Windows.

                                    Yeah, but that's what you get if you just blindly do "set encoding=utf-8".
                                    Took me a while to figure that one out. I need to do "set
                                    termencoding=cp1252" first, or the "let &termencoding = &encoding". Not
                                    exactly transparent to non-experts.

                                    > The default 'termencoding' is empty, which means 'encoding' is used.
                                    > There is no better default.

                                    On Windows, I'd say "detect active code page" is the right choice.

                                    > When you change 'encoding' you might have to change 'termencoding' as
                                    > well, but this depends on your situation.

                                    As noted above, that's the unintuitive behavior I was getting at. A
                                    windows user, knowing that unicode is the native charset, does a "set
                                    encoding=utf-8" and expects things to work. They don't, but depending on
                                    the language, it may take a while before a non-ascii character is entered.

                                    >>- The default fileencoding breaks when "going UTF-8", most probably a
                                    >>better behavior would be to default to the ACP always.
                                    >
                                    > 'fileencoding' is set when reading a file. Perhaps you mean
                                    > 'fileencodings'? This one needs to be tweaked by the user, because it
                                    > depends on what kind of files you edit. Main problem is that an ASCII
                                    > file can be any encoding, Vim can't detect what it is, thus the user has
                                    > to specify what he wants Vim to do with it.

                                    Yes, I was unclear. Let me elaborate, although this point is rather
                                    exotic, and you can safely ignore me. :)

                                    When setting "encoding=utf-8", any new files will suddenly be utf-8 as
                                    well. For "ordinary" windows users, this may not be the desired result.
                                    What I was getting at was that *perhaps* the default fileencoding should be
                                    "cp####" in this case, unless the user explicitly sets it to something else
                                    (presumably utf-8). Before you object, yes, that's silly.

                                    Why use "encoding=utf-8" if you still want to create new files as ANSI?
                                    Well, quite a few windows applications don't do UTF-8. But using UTF-8
                                    internally still allows users to *transparently* edit existing
                                    unicode/utf-8 files without conversions.

                                    Anyway, I digress. This thought of mine was not that bright. Just forget it.

                                    >>- Also, my vim (6.2) defaults to "latin1", not my current codepage. That
                                    >>would indicate that the ACP detection does not work.
                                    >
                                    > Where does it use "latin1"? Not in 'encoding', I suppose.

                                    Yes. Without a _vimrc, I get:
                                    encoding=latin1
                                    fileencodings=ucs-bom
                                    termencoding=

                                    Thus changing the encoding only has funny effects.

                                    > Mostly it's quite more complicated. Different users have different
                                    > situations, it is hard to think of solutions that work for most people.

                                    Well, if you decide to make the unicode implementation work as it should,
                                    most people should be able to get what they want. It might involve a bit
                                    of tweaking, but nothing more.

                                    > The problem is that conversions to/from Unicode only work when you know
                                    > the encoding of the text you are converting. The encoding isn't always
                                    > known. Vim sometimes uses "latin1", so that you at least get 8-bit
                                    > clean editing, even though the actual encoding is unknown.

                                    I claim that on Windows, you should always have a good idea of the
                                    encoding. It's either explicitly set by the user, "cp####", or unicode.
                                    Windows has good support for converting ANSI to unicode, so this should be
                                    a non-issue. And again, as this is about non-UTF-8 data, you already have
                                    this problem anyway, because you are calling the ANSI functions with the
                                    "unknown" data. That it works should prove my point. ;-)

                                    But in the universal case, I agree with you.

                                    >>On Win9x, vim should use ANSI apis. The only thing missing is again the
                                    >>encoding/decoding, although it's trickier with the ANSI apis. There are
                                    >>many cases where an user would enter UTF-8 stuff that doesn't smootly
                                    >>convert to the current CP. I think vim's current code should detect that
                                    >>easily.
                                    >
                                    > You can use a few Unicode functions on Win9x, we already do. I don't
                                    > see a reason to change this.

                                    Sorry, I didn't want to imply that. I agree that we should stick to the
                                    unicode functions that are supported on Win9x, and only revert to ANSI
                                    "when forced".

                                    Camillo
                                    --
                                    Camillo Särs <+ged+@...> ** Aim for the impossible and you
                                    <http://www.iki.fi/+ged> ** will achieve the improbable.
                                    PGP public key available **
                                  • Bram Moolenaar
                                    ... Because every fopen(), stat() etc. will have to be changed. ... This only means extra work, since an if (encoding == ...) has to be added to select
                                    Message 17 of 29 , Oct 13, 2003
                                    • 0 Attachment
                                      Camillo wrote:

                                      > > Vim could use Unicode functions for accessing files, but this will be a
                                      > > huge change.
                                      >
                                      > Why so? The code earlier in this thread probably did much of what is
                                      > needed. It also involved numerous other changes, which I ignored. I'm not
                                      > being nosy, I'm just curious why this would be a "huge change". It's not
                                      > the file contents we are getting at, it's the filenames (and the GUI).

                                      Because every fopen(), stat() etc. will have to be changed.

                                      > Also note that when using the native code page as the encoding (read:
                                      > latin1), using the ANSI functions do work as expected. So the fixes would
                                      > only need to concern the UTF-8 encoding, if you get picky. :)

                                      This only means extra work, since an "if (encoding == ...)" has to be
                                      added to select between the traditional file access method and the
                                      Unicode method.

                                      > > Requires lots of testing.
                                      >
                                      > That's unicode for you. However, deriving a decent test set using
                                      > available unicode test files should be a fairly straight-forward thing.

                                      No, it's actually impossible to test this automatically. It involves
                                      creating various Win32 environments with code page settings, network
                                      filesystems and installed libraries. Only end-user tests can discover
                                      the real problems.

                                      > > Main problem is when 'encoding' is not a Unicode encoding, then conversions
                                      > > need to be done, which may fail.
                                      >
                                      > But what I assume you are doing now is even worse, isn't it? Essentially
                                      > you are be feeding some user-selected encoding to functions that require
                                      > ANSI characters. How's that for "a lot of testing"?

                                      The currently used functions work fine for accessing existing files.
                                      It's only when typing a new name or when displaying the name that
                                      problems may occur.

                                      > Conversions from almost any encoding to unicode should work. I would not
                                      > expect major trouble there. And note that if the conversion from the
                                      > encoding to unicode fails, I expect that the current usage would fail even
                                      > more severely. And there haven't been reports of that, has there?

                                      Main problem is that sometimes we don't know what the encoding is. In
                                      that situation you can treat the filename as a sequence of bytes in most
                                      places, but conversion is impossible. This happens more often than you
                                      would expect. Put a floppy disk or CD into your computer...

                                      There is also the situation that Vim uses the active codepage, but the
                                      file is actually in another encoding that could not be detected. Then
                                      doing "gf" on a filename will work if you don't do conversion, but it
                                      will fail if you try converting with the wrong encoding in mind.

                                      > > Thus sticking with the active codepage functions isn't too bad.
                                      >
                                      > If it worked that way, but it doesn't. Setting "encoding=utf-8" changes
                                      > that behavior - only us-ascii is usable in filenames.

                                      I don't see why. You can use a file selector to open any file and write
                                      it back under the same name. Vim doesn't need to know the encoding of
                                      the filename that way.

                                      If you type a file name in utf-8 it won't work properly, thus you have
                                      to use another method to obtain the file name. It's clumsy, I know.

                                      > > But then Vim needs to convert from 'encoding' to the active codepage!
                                      >
                                      > That would help most users. Including me. But it would not be the
                                      > "ultimate" solution to unicode on win32, as it would still cause trouble
                                      > with characters outside the codepage. As I see it, the easiest fix is
                                      > actually using the unicode-api, as there are less (or no) conversion
                                      > failures that way.

                                      As said above, this only works if we are 100% sure of what encoding the
                                      text (filename) is in, and we don't always know that.

                                      > > Why would 'termencoding' be "utf-8"? This won't work, unless you are
                                      > > using an xterm on MS-Windows.
                                      >
                                      > Yeah, but that's what you get if you just blindly do "set encoding=utf-8".
                                      > Took me a while to figure that one out. I need to do "set
                                      > termencoding=cp1252" first, or the "let &termencoding = &encoding". Not
                                      > exactly transparent to non-experts.

                                      Setting 'encoding' is full of side effects. There is a clear warning in
                                      the docs about this.

                                      > > The default 'termencoding' is empty, which means 'encoding' is used.
                                      > > There is no better default.
                                      >
                                      > On Windows, I'd say "detect active code page" is the right choice.

                                      I remember this was proposed before, I can't remember why we didn't do
                                      it this way. Windows is different here, since we can find out what the
                                      active codepage is. On Unix it's not that clear (e.g., depends on what
                                      options the xterm was started with). Consistency between systems is
                                      preferred.

                                      > >>- Also, my vim (6.2) defaults to "latin1", not my current codepage. That
                                      > >>would indicate that the ACP detection does not work.
                                      > >
                                      > > Where does it use "latin1"? Not in 'encoding', I suppose.
                                      >
                                      > Yes. Without a _vimrc, I get:
                                      > encoding=latin1
                                      > fileencodings=ucs-bom
                                      > termencoding=
                                      >
                                      > Thus changing the encoding only has funny effects.

                                      Your active codepage must be latin1 then. Vim gets the default from the
                                      active codepage.

                                      --
                                      hundred-and-one symptoms of being an internet addict:
                                      192. Your boss asks you to "go fer" coffee and you come up with 235 FTP sites.

                                      /// Bram Moolenaar -- Bram@... -- http://www.Moolenaar.net \\\
                                      /// Creator of Vim - Vi IMproved -- http://www.Vim.org \\\
                                      \\\ Project leader for A-A-P -- http://www.A-A-P.org ///
                                      \\\ Help AIDS victims, buy here: http://ICCF-Holland.org/click1.html ///
                                    • Camillo Särs
                                      ... Right. You re not using Windows apis, of course. But to do things correctly, you would have to make sure that the fopen() etc. implementations [in
                                      Message 18 of 29 , Oct 13, 2003
                                      • 0 Attachment
                                        Bram Moolenaar wrote:
                                        > Because every fopen(), stat() etc. will have to be changed.

                                        Right. You're not using Windows apis, of course. But to do things
                                        correctly, you would have to make sure that the fopen() etc.
                                        implementations [in Windows] either convert the strings they receive or
                                        only are called with valid Windows file names. Converting internally may
                                        be risky, because you'd need a way to convey the encoding into the functions.

                                        > Main problem is that sometimes we don't know what the encoding is.

                                        On Windows? I would disagree here. Any filesystem mounted by Windows
                                        should be mounted in a way that adheres to Windows naming conventions.
                                        We're not discussing file contents here.

                                        > In that situation you can treat the filename as a sequence of bytes in most
                                        > places, but conversion is impossible. This happens more often than you
                                        > would expect. Put a floppy disk or CD into your computer...

                                        So why convert it? :) The current display/saving problems stem from the
                                        fact that the file name is interpreted as UTF-8, a coding which Windows
                                        does not recognize for file names or strings.

                                        > There is also the situation that Vim uses the active codepage, but the
                                        > file is actually in another encoding that could not be detected. Then
                                        > doing "gf" on a filename will work if you don't do conversion, but it
                                        > will fail if you try converting with the wrong encoding in mind.

                                        AFAIK, Windows will internally convert the path into Unicode if you call
                                        the ANSI function. Thus if gf succeeds as you describe, it should succeed
                                        if you use the unicode api as well. In both cases a 8-bit binary string
                                        undergoes "cp2unicode" conversion.

                                        > I don't see why. You can use a file selector to open any file and write
                                        > it back under the same name.

                                        Uhm. Thanks. I'm so used to using :edit and :view that this alternative
                                        hadn't even crossed my mind.

                                        > If you type a file name in utf-8 it won't work properly, thus you have
                                        > to use another method to obtain the file name. It's clumsy, I know.

                                        But it's a workaround. But my title bar still is a mess.

                                        > As said above, this only works if we are 100% sure of what encoding the
                                        > text (filename) is in, and we don't always know that.

                                        We should be sure. And *if* we get it wrong, the user should be able to
                                        correct it.

                                        > I remember this was proposed before, I can't remember why we didn't do
                                        > it this way. Windows is different here, since we can find out what the
                                        > active codepage is. On Unix it's not that clear (e.g., depends on what
                                        > options the xterm was started with). Consistency between systems is
                                        > preferred.

                                        I would disagree on consistency here. On windows, the encoding is either
                                        ANSI or unicode, or then it has been explicitly set to something known.
                                        And as long as we know the encoding, let's use it.

                                        > Your active codepage must be latin1 then. Vim gets the default from the
                                        > active codepage.

                                        My code page is cp1252. It's not latin1 (iso-8859-1). In practice, both
                                        are 8-bit-raw.

                                        Camillo
                                        --
                                        Camillo Särs <+ged+@...> ** Aim for the impossible and you
                                        <http://www.iki.fi/+ged> ** will achieve the improbable.
                                        PGP public key available **
                                      • Tony Mechelynck
                                        ... [...] ... [...] Glenn Maynard wants encoding to default to utf-8 regardless of the active codepage. IMHO this would require termencoding to default,
                                        Message 19 of 29 , Oct 13, 2003
                                        • 0 Attachment
                                          Bram Moolenaar <Bram@...> wrote:
                                          > Camillo wrote:
                                          [...]
                                          > > - The default termencoding should be set intelligently, UTF-8 as
                                          > > termencoding breaks input of non-ascii.
                                          >
                                          > Why would 'termencoding' be "utf-8"? This won't work, unless you are
                                          > using an xterm on MS-Windows. The default 'termencoding' is empty,
                                          > which means 'encoding' is used. There is no better default. When you
                                          > change 'encoding' you might have to change 'termencoding' as well, but
                                          > this depends on your situation.
                                          [...]

                                          Glenn Maynard wants 'encoding' to default to "utf-8" regardless of the
                                          active codepage. IMHO this would require 'termencoding' to default, not to
                                          the empty string, but to what is currently the default 'encoding', namely
                                          the active codepage. Such change in the 'termencoding' default would (again,
                                          IMHO) be a GoodThing anyway, since it would allow the keyboard to go on
                                          working whether or not the user alters 'encoding'. Of course it is already
                                          possible to do

                                          if &termencoding == ""
                                          let &termencoding = &encoding
                                          endif

                                          but wouldn't it make it easier to the user (more user friendly) to have
                                          'termencoding' default to the ACP not implicitly (&termencoding == "" and
                                          'encoding' set to the ACP) but explicitly (by defaulting 'termencoding' to a
                                          nonempty value representing the active codepage)? -- And it would make the
                                          above "if" statement unnecessary but not harmful, so existing scripts should
                                          not be broken.

                                          Regards,
                                          Tony.
                                        • Tony Mechelynck
                                          ... [...] ... Took me some figuring too. A few hours ago I uploaded my solution to vim-onlline (set_utf8.vim,
                                          Message 20 of 29 , Oct 13, 2003
                                          • 0 Attachment
                                            Camillo Särs <ged@...> wrote:
                                            > Bram Moolenaar wrote:
                                            [...]
                                            > > Why would 'termencoding' be "utf-8"? This won't work, unless you
                                            > > are
                                            > > using an xterm on MS-Windows.
                                            >
                                            > Yeah, but that's what you get if you just blindly do "set
                                            > encoding=utf-8". Took me a while to figure that one out. I need to
                                            > do "set termencoding=cp1252" first, or the "let &termencoding =
                                            > &encoding". Not exactly transparent to non-experts.

                                            Took me some figuring too. A few hours ago I uploaded my solution to
                                            vim-onlline (set_utf8.vim,
                                            http://vim.sourceforge.net/scripts/script.php?script_id=789 ). I hope it
                                            will make it transparent to non-experts. Yet I still believe that defaulting
                                            'termencoding' to the locale's charset would be better than leaving it
                                            empty -- and such a change wouldn't break the above-mentioned script, you're
                                            welcome to look at its source.
                                            >
                                            > > The default 'termencoding' is empty, which means 'encoding' is used.
                                            > > There is no better default.
                                            >
                                            > On Windows, I'd say "detect active code page" is the right choice.
                                            >
                                            > > When you change 'encoding' you might have to change 'termencoding'
                                            > > as
                                            > > well, but this depends on your situation.
                                            >
                                            > As noted above, that's the unintuitive behavior I was getting at. A
                                            > windows user, knowing that unicode is the native charset, does a "set
                                            > encoding=utf-8" and expects things to work. They don't, but
                                            > depending on
                                            > the language, it may take a while before a non-ascii character is
                                            > entered.
                                            [...]

                                            Regards,
                                            Tony.
                                          • Bram Moolenaar
                                            ... A file name may appear in a file (e.g., a list of files in a README file). And I don t know what happens with file names on removable media (e.g., a CD).
                                            Message 21 of 29 , Oct 13, 2003
                                            • 0 Attachment
                                              Camillo wrote:

                                              > > Main problem is that sometimes we don't know what the encoding is.
                                              >
                                              > On Windows? I would disagree here. Any filesystem mounted by Windows
                                              > should be mounted in a way that adheres to Windows naming conventions.
                                              > We're not discussing file contents here.

                                              A file name may appear in a file (e.g., a list of files in a README
                                              file). And I don't know what happens with file names on removable media
                                              (e.g., a CD). Probably depends on the file system it contains. And
                                              networked file systems is another problem.

                                              > > In that situation you can treat the filename as a sequence of bytes in most
                                              > > places, but conversion is impossible. This happens more often than you
                                              > > would expect. Put a floppy disk or CD into your computer...
                                              >
                                              > So why convert it? :) The current display/saving problems stem from the
                                              > fact that the file name is interpreted as UTF-8, a coding which Windows
                                              > does not recognize for file names or strings.

                                              We need to locate places where the encoding is different from what a
                                              system function expects. There are still a few things that need to be
                                              fixed.

                                              > > There is also the situation that Vim uses the active codepage, but the
                                              > > file is actually in another encoding that could not be detected. Then
                                              > > doing "gf" on a filename will work if you don't do conversion, but it
                                              > > will fail if you try converting with the wrong encoding in mind.
                                              >
                                              > AFAIK, Windows will internally convert the path into Unicode if you call
                                              > the ANSI function. Thus if gf succeeds as you describe, it should succeed
                                              > if you use the unicode api as well. In both cases a 8-bit binary string
                                              > undergoes "cp2unicode" conversion.

                                              If Vim defaults to the active codepage then conversion to Unicode would
                                              do the same as using the ANSI function. Thus it's only a problem when
                                              'encoding' is different from the active codepage. And when 'encoding'
                                              is a Unicode variant we can use the "W" functions. Still, this means
                                              all fopen() and stat() calls must be adjusted. When 'encoding' is not
                                              the active codepage we could either leave the file name untranslated (as
                                              it's now) or convert it to Unicode. Don't know which one would work
                                              best...

                                              > > Your active codepage must be latin1 then. Vim gets the default from the
                                              > > active codepage.
                                              >
                                              > My code page is cp1252. It's not latin1 (iso-8859-1). In practice, both
                                              > are 8-bit-raw.

                                              cp1252 and latin1 are not identical, but for practical use they can be
                                              handled as the same encoding. Vim indeed uses this as the "raw" 8-bit
                                              encoding that avoids messing up your characters when you don't know what
                                              encoding it actually is.

                                              --
                                              hundred-and-one symptoms of being an internet addict:
                                              194. Your business cards contain your e-mail and home page address.

                                              /// Bram Moolenaar -- Bram@... -- http://www.Moolenaar.net \\\
                                              /// Creator of Vim - Vi IMproved -- http://www.Vim.org \\\
                                              \\\ Project leader for A-A-P -- http://www.A-A-P.org ///
                                              \\\ Help AIDS victims, buy here: http://ICCF-Holland.org/click1.html ///
                                            • Glenn Maynard
                                              Note that I ve upgraded, and I m not having problems with files saving incorrectly in enc=utf-8. The remaining problems are mostly cosmetic, except for not
                                              Message 22 of 29 , Oct 13, 2003
                                              • 0 Attachment
                                                Note that I've upgraded, and I'm not having problems with files saving
                                                incorrectly in enc=utf-8. The remaining problems are mostly cosmetic,
                                                except for not being able to ":w 漢字.txt" with the ACP being Japanese.

                                                On Mon, Oct 13, 2003 at 02:25:04PM +0200, Bram Moolenaar wrote:
                                                > Because every fopen(), stat() etc. will have to be changed.

                                                I don't think handling Unicode in filenames is worth it in Windows. It
                                                takes so much work that the only applications I know of that support it
                                                are ones that are compiled as native Unicode apps. The only exception
                                                I've seen is FB2k.

                                                It's certainly useful to be able to have multilingual filenames, but
                                                Windows makes it so hard that people really wanting to do that probably
                                                need a new OS.

                                                > I don't see why. You can use a file selector to open any file and write
                                                > it back under the same name. Vim doesn't need to know the encoding of
                                                > the filename that way.

                                                Consider the case where a filename in NT contains illegal data, eg. an
                                                invalid two-byte SJIS sequence. When you call NT ANSI system calls, it
                                                converts the buffers you pass it to WCHAR. That conversion would fail.

                                                Are you worried about not being able to open files off eg. a slightly
                                                corrupt/malformed floppy disc containing filenames that won't convert
                                                cleanly? That seems no worse than not being able to use non-ACP
                                                filenames. If that works, it seems a poor trade for not being able to
                                                enter non-ASCII filenames in utf-8. ":w 漢字.txt" responding with
                                                '"漢字.txt" [New]' and writing the filename correctly seems pretty
                                                fundamental, for Japanese users on Japanese systems, and that doesn't
                                                work with enc=utf-8.

                                                > I remember this was proposed before, I can't remember why we didn't do
                                                > it this way. Windows is different here, since we can find out what the
                                                > active codepage is. On Unix it's not that clear (e.g., depends on what
                                                > options the xterm was started with). Consistency between systems is
                                                > preferred.

                                                Windows and Unix handle encodings fundamentally differently, so complete
                                                consistency means one or the other system not working as well. It seems
                                                like "consistency to a fault". :)

                                                Here's what I see, though: Windows APIs are always giving ACP or Unicode
                                                data. Vim honors that for some code paths: input methods, copying to
                                                and from the system clipboard. It ignores it and uses Unix paradigms
                                                for others: filenames, most other ANSI calls.

                                                The former, in my experience, work consistently; I can enter text with
                                                the IME in both UTF-8 and CP932, and copy and paste reliably. The
                                                latter do not: entered filenames don't work, non-ASCII text in the
                                                titlebar shows <ab> hex values.

                                                --
                                                Glenn Maynard
                                              • Camillo Särs
                                                ... Both floppies, CDs and network file systems are mounted by windows, and some translation of file names happens. AFAIK, you should be able to access all
                                                Message 23 of 29 , Oct 13, 2003
                                                • 0 Attachment
                                                  Bram Moolenaar wrote:
                                                  > A file name may appear in a file (e.g., a list of files in a README
                                                  > file). And I don't know what happens with file names on removable media
                                                  > (e.g., a CD). Probably depends on the file system it contains. And
                                                  > networked file systems is another problem.

                                                  Both floppies, CDs and network file systems are mounted by windows, and
                                                  "some" translation of file names happens. AFAIK, you should be able to
                                                  access all files on such file systems using WindowsNT naming conventions.
                                                  The file names may not be exactly what you anticipated, but they are
                                                  guaranteed to stay constant.

                                                  > We need to locate places where the encoding is different from what a
                                                  > system function expects. There are still a few things that need to be
                                                  > fixed.

                                                  Yup. As I'm not familiar with the vim sources, I don't know how much work
                                                  this would mean in reality. However, the set of functions is or should be
                                                  known, and fairly limited.

                                                  > When 'encoding' is not the active codepage we could either leave
                                                  > the file name untranslated (as it's now) or convert it to Unicode.
                                                  > Don't know which one would work best...

                                                  Me neither. But I think that a conversion to unicode should be "fairly"
                                                  straight-forward, as it is what NT does natively anyway. This leads me to
                                                  think that Vim should do the conversion, as it knows the encoding. Or
                                                  let's say, it thinks it knows it. :)

                                                  Cheers,
                                                  Camillo
                                                  --
                                                  Camillo Särs <+ged+@...> ** Aim for the impossible and you
                                                  <http://www.iki.fi/+ged> ** will achieve the improbable.
                                                  PGP public key available **
                                                • Bram Moolenaar
                                                  ... So, what you suggest is to keep using the ordinary file system functions. But we must make sure that the file name is then in the active codepage
                                                  Message 24 of 29 , Oct 14, 2003
                                                  • 0 Attachment
                                                    Glenn Maynard wrote:

                                                    > On Mon, Oct 13, 2003 at 02:25:04PM +0200, Bram Moolenaar wrote:
                                                    > > Because every fopen(), stat() etc. will have to be changed.
                                                    >
                                                    > I don't think handling Unicode in filenames is worth it in Windows. It
                                                    > takes so much work that the only applications I know of that support it
                                                    > are ones that are compiled as native Unicode apps. The only exception
                                                    > I've seen is FB2k.
                                                    >
                                                    > It's certainly useful to be able to have multilingual filenames, but
                                                    > Windows makes it so hard that people really wanting to do that probably
                                                    > need a new OS.

                                                    So, what you suggest is to keep using the ordinary file system
                                                    functions. But we must make sure that the file name is then in the
                                                    active codepage encoding. When obtaining the file name with a system
                                                    function (e.g., a directory listing or file browser) it will already be
                                                    in that encoding. But when the user types a file name it's in the
                                                    encoding specified with 'encoding'. This means we would need to convert
                                                    the file name from 'encoding' to the active codepage at some point.
                                                    And the reverse conversion is needed when using a filename as a text
                                                    string, e.g., for "%p and in the window title.

                                                    This is still complicated, but probably requires less changes than using
                                                    Unicode functions for all file access. I only foresee trouble when
                                                    'encoding' is set to a non-Unicode codepage different from the active
                                                    codepage and using a filename that contains non-ASCII characters.
                                                    Perhaps this situation is too weird to take into account?

                                                    > > I don't see why. You can use a file selector to open any file and write
                                                    > > it back under the same name. Vim doesn't need to know the encoding of
                                                    > > the filename that way.
                                                    >
                                                    > Consider the case where a filename in NT contains illegal data, eg. an
                                                    > invalid two-byte SJIS sequence. When you call NT ANSI system calls, it
                                                    > converts the buffers you pass it to WCHAR. That conversion would fail.
                                                    >
                                                    > Are you worried about not being able to open files off eg. a slightly
                                                    > corrupt/malformed floppy disc containing filenames that won't convert
                                                    > cleanly? That seems no worse than not being able to use non-ACP
                                                    > filenames. If that works, it seems a poor trade for not being able to
                                                    > enter non-ASCII filenames in utf-8. ":w $B4A;z(B.txt"
                                                    > responding with '"$B4A;z(B.txt" [New]' and writing the
                                                    > filename correctly seems pretty fundamental, for Japanese users on
                                                    > Japanese systems, and that doesn't work with enc=utf-8.

                                                    Yep, using conversions means failure is possible. And failure mostly
                                                    means the text is in a different encoding than expected. It would take
                                                    some time to figure out how to do this in a way that the user isn't
                                                    confused.

                                                    --
                                                    hundred-and-one symptoms of being an internet addict:
                                                    210. When you get a divorce, you don't care about who gets the children,
                                                    but discuss endlessly who can use the email address.

                                                    /// Bram Moolenaar -- Bram@... -- http://www.Moolenaar.net \\\
                                                    /// Creator of Vim - Vi IMproved -- http://www.Vim.org \\\
                                                    \\\ Project leader for A-A-P -- http://www.A-A-P.org ///
                                                    \\\ Help AIDS victims, buy here: http://ICCF-Holland.org/click1.html ///
                                                  • Camillo Särs
                                                    ... While that may sound attractive at first, I would strongly dissuade from that solution. I consider it to be a myth that using multilingual filenames on
                                                    Message 25 of 29 , Oct 14, 2003
                                                    • 0 Attachment
                                                      Bram Moolenaar wrote:
                                                      > Glenn Maynard wrote:
                                                      >>It's certainly useful to be able to have multilingual filenames, but
                                                      >>Windows makes it so hard that people really wanting to do that probably
                                                      >>need a new OS.
                                                      >
                                                      > So, what you suggest is to keep using the ordinary file system
                                                      > functions. But we must make sure that the file name is then in the
                                                      > active codepage encoding.

                                                      While that may sound attractive at first, I would strongly dissuade from
                                                      that solution. I consider it to be a myth that using multilingual
                                                      filenames on Windows is hard. Under NT, it's should be a breeze for any
                                                      application that is even slightly Unicode-aware. When you decide to make
                                                      changes in Vim, it makes sense to look to the future and try to go the
                                                      "Unicode" way. XP Home Edition is gaining ground - fast.

                                                      Win9x is a mess, because it's just a version of DOS on hormones, and thus
                                                      is solidly entrenched in the single code page per application world. Using
                                                      the current code page should suffice there, though.

                                                      > This is still complicated, but probably requires less changes than using
                                                      > Unicode functions for all file access.

                                                      Why? I don't get it. You don't need to use Unicode functions for anything
                                                      except stuff that accepts strings. The current implementation is wrong,
                                                      because it feeds "encoding" text to ANSI functions. If you change it, I
                                                      don't see why doing a conversion to Unicode would be any different than a
                                                      conversion to ANSI, other than the fact than converting to ANSI is riskier.

                                                      <http://www.microsoft.com/globaldev/> contains a lot of useful info. Quote:

                                                      "All Win32 APIs that take a text argument either as an input or output
                                                      variable have been provided with a generic function prototype and two
                                                      definitions: a version that is based on code pages or ANSI (called "A") to
                                                      handle code page-based text argument and a wide version (called "W ") to
                                                      handle Unicode."

                                                      For 9x, you might be interested in the "Microsoft Layer for Unicode"

                                                      > I only foresee trouble when 'encoding' is set to a non-Unicode
                                                      > codepage different from the active codepage and using
                                                      > a filename that contains non-ASCII characters.
                                                      > Perhaps this situation is too weird to take into account?

                                                      As long as you know the correct code page, you can use Windows APIs to
                                                      convert correctly. They take the code page as an argument.

                                                      Camillo
                                                      --
                                                      Camillo Särs <+ged+@...> ** Aim for the impossible and you
                                                      <http://www.iki.fi/+ged> ** will achieve the improbable.
                                                      PGP public key available **
                                                    • Bram Moolenaar
                                                      ... Vim not only supports Unicode but also many other encodings. When Vim would only use Unicode it would be simple, but that s not the situation. And above
                                                      Message 26 of 29 , Oct 14, 2003
                                                      • 0 Attachment
                                                        Camillo wrote:

                                                        > While that may sound attractive at first, I would strongly dissuade from
                                                        > that solution. I consider it to be a myth that using multilingual
                                                        > filenames on Windows is hard. Under NT, it's should be a breeze for any
                                                        > application that is even slightly Unicode-aware. When you decide to make
                                                        > changes in Vim, it makes sense to look to the future and try to go the
                                                        > "Unicode" way. XP Home Edition is gaining ground - fast.

                                                        Vim not only supports Unicode but also many other encodings. When Vim
                                                        would only use Unicode it would be simple, but that's not the situation.
                                                        And above that, Vim is also used on many other systems, and we try to
                                                        make it work the same way everywhere.

                                                        > > This is still complicated, but probably requires less changes than using
                                                        > > Unicode functions for all file access.
                                                        >
                                                        > Why? I don't get it. You don't need to use Unicode functions for anything
                                                        > except stuff that accepts strings. The current implementation is wrong,
                                                        > because it feeds "encoding" text to ANSI functions. If you change it, I
                                                        > don't see why doing a conversion to Unicode would be any different than a
                                                        > conversion to ANSI, other than the fact than converting to ANSI is riskier.
                                                        >
                                                        > <http://www.microsoft.com/globaldev/> contains a lot of useful info. Quote:
                                                        >
                                                        > "All Win32 APIs that take a text argument either as an input or output
                                                        > variable have been provided with a generic function prototype and two
                                                        > definitions: a version that is based on code pages or ANSI (called "A") to
                                                        > handle code page-based text argument and a wide version (called "W ") to
                                                        > handle Unicode."

                                                        Eh, what happens when I use fopen() or stat()? There is no ANSI or wide
                                                        version of these functions. And certainly not one that also works on
                                                        non-Win32 systems. And when using the wide version conversion needs to
                                                        be done from 'encoding' to Unicode, thus the conversion has to be there
                                                        as well. That's going to be a lot of work (many #ifdefs) and will
                                                        probably introduce new bugs.

                                                        > For 9x, you might be interested in the "Microsoft Layer for Unicode"
                                                        >
                                                        > > I only foresee trouble when 'encoding' is set to a non-Unicode
                                                        > > codepage different from the active codepage and using
                                                        > > a filename that contains non-ASCII characters.
                                                        > > Perhaps this situation is too weird to take into account?
                                                        >
                                                        > As long as you know the correct code page, you can use Windows APIs to
                                                        > convert correctly. They take the code page as an argument.

                                                        As mentioned before, we are not always sure what encoding the text has.
                                                        Conversion is then likely to fail. This especially happens for 8-bit
                                                        encodings, there is no way to automatically check what encoding these
                                                        files are.

                                                        I think we need a smart solution that doesn't attempt to handle all
                                                        situations but works predictably.

                                                        --
                                                        hundred-and-one symptoms of being an internet addict:
                                                        218. Your spouse hands you a gift wrapped magnet with your PC's name
                                                        on it and you accuse him or her of genocide.

                                                        /// Bram Moolenaar -- Bram@... -- http://www.Moolenaar.net \\\
                                                        /// Creator of Vim - Vi IMproved -- http://www.Vim.org \\\
                                                        \\\ Project leader for A-A-P -- http://www.A-A-P.org ///
                                                        \\\ Help AIDS victims, buy here: http://ICCF-Holland.org/click1.html ///
                                                      • Glenn Maynard
                                                        ... It s not at all a myth if you want code that is 1: portable and 2: works on 9x, too. (If you can deal with nonportable code, you can use Windows s TCHAR
                                                        Message 27 of 29 , Oct 14, 2003
                                                        • 0 Attachment
                                                          > While that may sound attractive at first, I would strongly dissuade from
                                                          > that solution. I consider it to be a myth that using multilingual
                                                          > filenames on Windows is hard. Under NT, it's should be a breeze for any

                                                          It's not at all a myth if you want code that is 1: portable and 2: works
                                                          on 9x, too. (If you can deal with nonportable code, you can use Windows's
                                                          TCHAR mechanism, and if you don't care about anything but NT, you can write
                                                          a UTF-16-only app. Neither of these are the case here, though.)

                                                          It's not "hard", it's just "incredibly annoying".

                                                          On Tue, Oct 14, 2003 at 02:20:27PM +0200, Bram Moolenaar wrote:
                                                          > This is still complicated, but probably requires less changes than using
                                                          > Unicode functions for all file access. I only foresee trouble when
                                                          > 'encoding' is set to a non-Unicode codepage different from the active
                                                          > codepage and using a filename that contains non-ASCII characters.
                                                          > Perhaps this situation is too weird to take into account?

                                                          If "encoding" is not the ACP codepage, then the main problem is that the
                                                          user can enter characters that Vim simply can't put into a filename
                                                          (and in 9x, that the system can't, either).

                                                          I'd just do a conversion, and if the conversion fails, warn appropriately.

                                                          > Eh, what happens when I use fopen() or stat()? There is no ANSI or wide
                                                          > version of these functions. And certainly not one that also works on
                                                          > non-Win32 systems. And when using the wide version conversion needs to
                                                          > be done from 'encoding' to Unicode, thus the conversion has to be there
                                                          > as well. That's going to be a lot of work (many #ifdefs) and will
                                                          > probably introduce new bugs.

                                                          It's not that much work. Windows has _wfopen and _wstat. Vim already
                                                          has those abstracted (mch_fopen, mch_stat), so conversions would only
                                                          happen in one place (and in a place that's intended to be platform-
                                                          specific, mch_*). I believe the code I linked earlier did exactly this.

                                                          The only thing needed is sane error recovery.

                                                          > Yep, using conversions means failure is possible. And failure mostly
                                                          > means the text is in a different encoding than expected. It would take
                                                          > some time to figure out how to do this in a way that the user isn't
                                                          > confused.

                                                          Well, bear in mind the non-ACP case that already exists. If I create
                                                          "foo ♡.txt", and try to edit it with Vim, it edits "foo ?.txt" (which
                                                          it can't write, either, since "?" is an invalid character in Windows
                                                          filenames). I'd suggest that editing a file with an invalid character
                                                          (eg. invalid SJIS sequence) behave identically to editing a file with
                                                          a valid character that can't be referenced (eg. "foo ♡.txt").

                                                          --
                                                          Glenn Maynard
                                                        • Camillo Särs
                                                          ... Agreed. There s no way around that. ... Sounds very promising. It would be really great if it turns out that the changes are fairly minor. That way
                                                          Message 28 of 29 , Oct 14, 2003
                                                          • 0 Attachment
                                                            Glenn Maynard wrote:
                                                            > If "encoding" is not the ACP codepage, then the main problem is that the
                                                            > user can enter characters that Vim simply can't put into a filename
                                                            > (and in 9x, that the system can't, either).
                                                            >
                                                            > I'd just do a conversion, and if the conversion fails, warn appropriately.

                                                            Agreed. There's no way around that.

                                                            > It's not that much work. Windows has _wfopen and _wstat. Vim already
                                                            > has those abstracted (mch_fopen, mch_stat), so conversions would only
                                                            > happen in one place (and in a place that's intended to be platform-
                                                            > specific, mch_*). I believe the code I linked earlier did exactly this.
                                                            >
                                                            > The only thing needed is sane error recovery.

                                                            Sounds very promising. It would be really great if it turns out that the
                                                            changes are fairly minor. That way there's a chance they would get
                                                            implemented. :)

                                                            If you decide to try the proposed changes out, I'm prepared to do some
                                                            testing on a Win32 binary build. Sorry, can't build myself. :(

                                                            Camillo
                                                            --
                                                            Camillo Särs <+ged+@...> ** Aim for the impossible and you
                                                            <http://www.iki.fi/+ged> ** will achieve the improbable.
                                                            PGP public key available **
                                                          • Bram Moolenaar
                                                            ... It s more complicated then that. You can have filenames in the ACP, encoding and Unicode. Filenames are stored in various places inside Vim, which
                                                            Message 29 of 29 , Oct 15, 2003
                                                            • 0 Attachment
                                                              Glenn Maynard wrote:

                                                              > On Tue, Oct 14, 2003 at 02:20:27PM +0200, Bram Moolenaar wrote:
                                                              > > This is still complicated, but probably requires less changes than using
                                                              > > Unicode functions for all file access. I only foresee trouble when
                                                              > > 'encoding' is set to a non-Unicode codepage different from the active
                                                              > > codepage and using a filename that contains non-ASCII characters.
                                                              > > Perhaps this situation is too weird to take into account?
                                                              >
                                                              > If "encoding" is not the ACP codepage, then the main problem is that the
                                                              > user can enter characters that Vim simply can't put into a filename
                                                              > (and in 9x, that the system can't, either).
                                                              >
                                                              > I'd just do a conversion, and if the conversion fails, warn appropriately.

                                                              It's more complicated then that. You can have filenames in the ACP,
                                                              'encoding' and Unicode. Filenames are stored in various places inside
                                                              Vim, which encoding is used for each of them? Obviously, a filename
                                                              stored in buffer text and registers has to use 'encoding'.

                                                              It's less obvious what to use for internal structures, such as
                                                              curbuf->b_ffname. When 'encoding' is a Unicode encoding we can use
                                                              UTF-8, that can be converted to anything else. That also works when the
                                                              active codepage is not Unicode, we can use the wide functions then.

                                                              When 'encoding' is the active codepage (this is the default, should
                                                              happen a lot), we can use the active codepage. That avoids conversions
                                                              (which may fail). No need to use wide functions then.

                                                              The real problem is when 'encoding' is not the active codepage and it's
                                                              also not a Unicode encoding. We could simply skip the conversion then.
                                                              That doesn't work properly for non-ASCII characters, but it's how it
                                                              already works right now. The right way would be to convert the file
                                                              name to Unicode and use the wide functions.

                                                              I guess this means all filenames inside Vim are in 'encoding'. Where
                                                              needed, conversion needs to be done from/to Unicode and the wide
                                                              functions are to be used then.

                                                              The main thing to implement now is using the wide functions when
                                                              'encoding' is UTF-8. This only requires a simple conversion between
                                                              UTF-8 and UCS-16. I'll be waiting for a patch...

                                                              --
                                                              hundred-and-one symptoms of being an internet addict:
                                                              231. You sprinkle Carpet Fresh on the rugs and put your vacuum cleaner
                                                              in the front doorway permanently so it always looks like you are
                                                              actually attempting to do something about that mess that has amassed
                                                              since you discovered the Internet.

                                                              /// Bram Moolenaar -- Bram@... -- http://www.Moolenaar.net \\\
                                                              /// Creator of Vim - Vi IMproved -- http://www.Vim.org \\\
                                                              \\\ Project leader for A-A-P -- http://www.A-A-P.org ///
                                                              \\\ Help AIDS victims, buy here: http://ICCF-Holland.org/click1.html ///
                                                            Your message has been successfully submitted and would be delivered to recipients shortly.