Loading ...
Sorry, an error occurred while loading the content.

996Re: Filename encodings under Win32

Expand Messages
  • Tony Mechelynck
    Oct 12, 2003
      Glenn Maynard <glenn@...> wrote:
      > On Mon, Oct 13, 2003 at 02:41:25AM +0200, Tony Mechelynck wrote:
      > > Trivial or not, my opinion is that handling files and keypresses as
      > > per the locale shouldn't be a "fix", it should be the (program)
      > > default. The "minor fix" consists of making Unicode the (user's)
      > > default by means of a config setting; but see below about that.
      > My suggestion was that these be the default settings in Windows, not
      > be settings that the user has to fix.

      I understood you as meaning that the program-default setting should be
      Unicode. I beg to differ, however. Or maybe I misunderstood what you were
      saying. And whatever the program-default settings, Vim should (IMHO) work in
      as constant a manner as possible across all platforms.
      > > Sorry, but it is. AFAIK, leaving 'termencoding' empty when switching
      > > 'encoding' over from something else to Unicode produces
      > > dysfunctions in the keyboard for all users whose actual keyboard
      > > encoding is other than 7-bit ASCII -- roughly speaking, for all
      > > users with a keyboard for a language other than English (even
      > > Dutchmen like Bram need, as a minimum, the "lowercase e with
      > > diaeresis", which is over 128, and therefore receives a different
      > > representation in UTF-8 and in other encodings -- the codepoint
      > > number maybe the same but it is not represented identically).
      > > That's why the lines
      > This sounds like a bug. The input from Windows is always in the
      > system encoding (ACP) or Unicode. So, either termencoding should be
      > ignored,
      > or (if someone actually has a real use for changing it in Windows) it
      > should default to the appropriate codepage, as I suggested.

      It doesn't sound like a bug to me, but as a musunderstanding between Windows
      and Vim as they suddenly aren't "speaking ther same language" anymore. Let's
      spell out what I mean with an example:

      Let's say I press a "lowercase e with acute accent" (by far the most
      frequent accented letter in French, my mother language). On my keyboard it's
      the unshifted 2 key above the alphabet keys, but that doesn't matter much.
      Under (let's say) latin1 locale, Windows makes the byte 0xE9 available to
      gvim. The latter (in Insert mode and with latin1 'encoding') writes an
      e-acute into the buffer I'm correctly editing. This is correct behaviour.

      Now let's say I change 'encoding' to "utf-8". With 'termencoding' left empty
      (the default), gvim now suddenly expects the keyboard to be sending UTF-8
      byte sequences (because an empty 'termencoding' means it takes the same
      value as whatever is the current vazlue of 'encoding'). Windows, however, is
      not aware of any changes. It still sends 0xE9 for e-acute. Vim sees this,
      and since it is a valid header byte for a 3-byte UTF-8 sequence, it expects
      2 bytes in the range 0x80-0xBF following it. When they are not forthcoming,
      Vim puts the 0xE9 in the buffer, interprets it as invalid, and displays it
      as <E9>.

      However, if I take the precaution of first saving the older 'encoding' in
      'termencoding', then I may change 'encoding' to UTF-8 with no ill effects:
      gvim still expects latin1 from the keyboard, and when it reads 0xE9, it
      correctly interprets it as e-acute, and represents it internally as the
      UTF-8 byte sequence 0xC3 0xA9, which represents the codepoint U+00E9 "LATIN

      Note: My W98 system can set a variety of "national keyboards" -- I can even
      type Arabic in WordPad -- but they're a hassle because there is no
      correspondence between what is printed on the keys of my Belgian AZERTY
      keyboard and what those "national keyboards" send. At least, with Vims
      keymaps, I can design any number of keymaps to suit me, and, for instance,
      map the Russian deh or the Arabic daal to the Latin D key, which makes sense
      to me but does not necessarily correspond to where Russian or Arabic people
      expect their D key to be. AFAIK I cannot choose Unicode as the "national
      keyboard" (and, in fact, I don't need to, since it's easier for me to keep
      Windows set to French language with Belgian AZERTY keyboard, and let gvim
      handle non-Latin encodings by means of keymaps, digraphs, and/or the
      i_CTRL-V_digit capability).
      > > code, with (AFAIK) no possibility of repair in mainline Vim (which
      > > hasn't got the getacp() function -- and don't talk to me about a
      > > patch, I don't want to use other than standard binaries; for one
      > > thing, I don't have a
      > Um, the entire purpose of a patch is for it to be integrated into
      > mainline Vim.
      > However, the "code" I showed was just to demonstrate what I believe
      > the defaults should look like. They'd actually be set in the source,
      > not as
      > Vim commands. The "getacp()" call only makes it *possible* to do that
      > with Vim commands (which is useful itself).

      It may be useful in itself; but until and unless it is indeed (as you
      suggest) incorporated in mainline Vim source (a possibility towards which
      I'm not averse as long as it doesn't break something else), it "doesn't
      exist" from where I sit.
      > > Users who only edit files in a single 8 bit encoding don't need to
      > > bother about Unicode. For others, it is a useful choice, but I
      > > maintain that it should remain a choice, and, if the locale set in
      > > the operating system is not a Unicode one, it should IMHO remain a
      > > conscious choice (or at least a voluntary one, that need not stay
      > > conscious once it has been written into the vimrc).
      > Users, for the most part, don't care what the internal representation
      > is. Many users don't even know what an encoding is (and shouldn't
      > have
      > to). I've seen little reason for UTF-8 to not eventually be the
      > default internal encoding for Vim in Windows, once the remaining
      > issues are
      > resolved.
      > The only interesting, fundamental reason I've seen is memory usage:
      > UTF-8 uses more memory for many languages.

      Indeed. The difference is virtually nil for English; it is small but nonzero
      for other Latin-alphabet languages, it approaches 1 to 2 for other-alphabet
      languages like Greek or Russian (a little less than that because of spaces,
      commas, full stops, etc.); I don't know the ratio for languages like hindi
      (with nagari script) or Chinese (hanzi).
      > > UTF-8 is fully supported (well, almost fully: characterwise
      > > bidirectionality, a Unicode property, isn't supported) internally by
      > Not quite. It won't convert from UTF-8 to the ACP or Unicode when
      > calling Windows API functions. For example, if I open files with
      > kanji in the filename and enc=utf-8, the title bar has <12><34>
      > garbage
      > in it. Minimally, this should convert the string to CP932.
      > In any case, I'm not about to crusade for this. I'm mostly
      > interested in seeing the bugs where functionality is broken when
      > enc=utf-8 be fixed,
      > such as the title bar issue. I'd like to be able to say "use
      > enc=utf-8 internally and it'll fix your problems", which I
      > can't--because it
      > introduces new ones.
      > --
      > Glenn Maynard

      I see. My script won't fix the problems caused by kanji in filenames
      (personally I tend to shy away from anything other than us-ascii in
      filenames anyway; I have, however, some e-acutes in filenames automatically
      generated by Windows) but if you look at it, you'll see that it will make
      Unicode use easier (with, IMHO, little hassle and good transparency) for the
      average user of currently existing out-of-the-box multibyte versions of Vim.
      Having kanji in filenames display correctly on the titlebar (and, why not,
      on the status bar too) should be a separate fix, which ought to have no
      (positive or negative) influence on the workings of my script.

      By the way: what do you mean by ACP? The currently "active code page" maybe?

      Hm. Your "kanji in filenames" issue makes me think: could that be related to
      the fact that my Netscape 7 cannot properly handle Cyrillic letters between
      <title></title> HTML tags (what sits there displays on the title bar, and
      anything out-of-the-way is accepted but doesn't display properly, IIRC not
      even with a <meta> tag specifying that the page is in UTF-8) but can show
      them with no problems in body text, for instance between <H1></H1> (where
      the title could appear again, this time to be displayed on top of the text
      inside the browser window)? But this paragraph may be drifting off-topic.

      Best regards,
    • Show all 29 messages in this topic