996Re: Filename encodings under Win32
- Oct 12, 2003Glenn Maynard <glenn@...> wrote:
> On Mon, Oct 13, 2003 at 02:41:25AM +0200, Tony Mechelynck wrote:I understood you as meaning that the program-default setting should be
> > Trivial or not, my opinion is that handling files and keypresses as
> > per the locale shouldn't be a "fix", it should be the (program)
> > default. The "minor fix" consists of making Unicode the (user's)
> > default by means of a config setting; but see below about that.
> My suggestion was that these be the default settings in Windows, not
> be settings that the user has to fix.
Unicode. I beg to differ, however. Or maybe I misunderstood what you were
saying. And whatever the program-default settings, Vim should (IMHO) work in
as constant a manner as possible across all platforms.
> > Sorry, but it is. AFAIK, leaving 'termencoding' empty when switchingIt doesn't sound like a bug to me, but as a musunderstanding between Windows
> > 'encoding' over from something else to Unicode produces
> > dysfunctions in the keyboard for all users whose actual keyboard
> > encoding is other than 7-bit ASCII -- roughly speaking, for all
> > users with a keyboard for a language other than English (even
> > Dutchmen like Bram need, as a minimum, the "lowercase e with
> > diaeresis", which is over 128, and therefore receives a different
> > representation in UTF-8 and in other encodings -- the codepoint
> > number maybe the same but it is not represented identically).
> > That's why the lines
> This sounds like a bug. The input from Windows is always in the
> system encoding (ACP) or Unicode. So, either termencoding should be
> or (if someone actually has a real use for changing it in Windows) it
> should default to the appropriate codepage, as I suggested.
and Vim as they suddenly aren't "speaking ther same language" anymore. Let's
spell out what I mean with an example:
Let's say I press a "lowercase e with acute accent" (by far the most
frequent accented letter in French, my mother language). On my keyboard it's
the unshifted 2 key above the alphabet keys, but that doesn't matter much.
Under (let's say) latin1 locale, Windows makes the byte 0xE9 available to
gvim. The latter (in Insert mode and with latin1 'encoding') writes an
e-acute into the buffer I'm correctly editing. This is correct behaviour.
Now let's say I change 'encoding' to "utf-8". With 'termencoding' left empty
(the default), gvim now suddenly expects the keyboard to be sending UTF-8
byte sequences (because an empty 'termencoding' means it takes the same
value as whatever is the current vazlue of 'encoding'). Windows, however, is
not aware of any changes. It still sends 0xE9 for e-acute. Vim sees this,
and since it is a valid header byte for a 3-byte UTF-8 sequence, it expects
2 bytes in the range 0x80-0xBF following it. When they are not forthcoming,
Vim puts the 0xE9 in the buffer, interprets it as invalid, and displays it
However, if I take the precaution of first saving the older 'encoding' in
'termencoding', then I may change 'encoding' to UTF-8 with no ill effects:
gvim still expects latin1 from the keyboard, and when it reads 0xE9, it
correctly interprets it as e-acute, and represents it internally as the
UTF-8 byte sequence 0xC3 0xA9, which represents the codepoint U+00E9 "LATIN
SMALL E WITH ACUTE".
Note: My W98 system can set a variety of "national keyboards" -- I can even
type Arabic in WordPad -- but they're a hassle because there is no
correspondence between what is printed on the keys of my Belgian AZERTY
keyboard and what those "national keyboards" send. At least, with Vims
keymaps, I can design any number of keymaps to suit me, and, for instance,
map the Russian deh or the Arabic daal to the Latin D key, which makes sense
to me but does not necessarily correspond to where Russian or Arabic people
expect their D key to be. AFAIK I cannot choose Unicode as the "national
keyboard" (and, in fact, I don't need to, since it's easier for me to keep
Windows set to French language with Belgian AZERTY keyboard, and let gvim
handle non-Latin encodings by means of keymaps, digraphs, and/or the
>It may be useful in itself; but until and unless it is indeed (as you
> > code, with (AFAIK) no possibility of repair in mainline Vim (which
> > hasn't got the getacp() function -- and don't talk to me about a
> > patch, I don't want to use other than standard binaries; for one
> > thing, I don't have a
> Um, the entire purpose of a patch is for it to be integrated into
> mainline Vim.
> However, the "code" I showed was just to demonstrate what I believe
> the defaults should look like. They'd actually be set in the source,
> not as
> Vim commands. The "getacp()" call only makes it *possible* to do that
> with Vim commands (which is useful itself).
suggest) incorporated in mainline Vim source (a possibility towards which
I'm not averse as long as it doesn't break something else), it "doesn't
exist" from where I sit.
>Indeed. The difference is virtually nil for English; it is small but nonzero
> > Users who only edit files in a single 8 bit encoding don't need to
> > bother about Unicode. For others, it is a useful choice, but I
> > maintain that it should remain a choice, and, if the locale set in
> > the operating system is not a Unicode one, it should IMHO remain a
> > conscious choice (or at least a voluntary one, that need not stay
> > conscious once it has been written into the vimrc).
> Users, for the most part, don't care what the internal representation
> is. Many users don't even know what an encoding is (and shouldn't
> to). I've seen little reason for UTF-8 to not eventually be the
> default internal encoding for Vim in Windows, once the remaining
> issues are
> The only interesting, fundamental reason I've seen is memory usage:
> UTF-8 uses more memory for many languages.
for other Latin-alphabet languages, it approaches 1 to 2 for other-alphabet
languages like Greek or Russian (a little less than that because of spaces,
commas, full stops, etc.); I don't know the ratio for languages like hindi
(with nagari script) or Chinese (hanzi).
>I see. My script won't fix the problems caused by kanji in filenames
> > UTF-8 is fully supported (well, almost fully: characterwise
> > bidirectionality, a Unicode property, isn't supported) internally by
> Not quite. It won't convert from UTF-8 to the ACP or Unicode when
> calling Windows API functions. For example, if I open files with
> kanji in the filename and enc=utf-8, the title bar has <12><34>
> in it. Minimally, this should convert the string to CP932.
> In any case, I'm not about to crusade for this. I'm mostly
> interested in seeing the bugs where functionality is broken when
> enc=utf-8 be fixed,
> such as the title bar issue. I'd like to be able to say "use
> enc=utf-8 internally and it'll fix your problems", which I
> can't--because it
> introduces new ones.
> Glenn Maynard
(personally I tend to shy away from anything other than us-ascii in
filenames anyway; I have, however, some e-acutes in filenames automatically
generated by Windows) but if you look at it, you'll see that it will make
Unicode use easier (with, IMHO, little hassle and good transparency) for the
average user of currently existing out-of-the-box multibyte versions of Vim.
Having kanji in filenames display correctly on the titlebar (and, why not,
on the status bar too) should be a separate fix, which ought to have no
(positive or negative) influence on the workings of my script.
By the way: what do you mean by ACP? The currently "active code page" maybe?
Hm. Your "kanji in filenames" issue makes me think: could that be related to
the fact that my Netscape 7 cannot properly handle Cyrillic letters between
<title></title> HTML tags (what sits there displays on the title bar, and
anything out-of-the-way is accepted but doesn't display properly, IIRC not
even with a <meta> tag specifying that the page is in UTF-8) but can show
them with no problems in body text, for instance between <H1></H1> (where
the title could appear again, this time to be displayed on top of the text
inside the browser window)? But this paragraph may be drifting off-topic.
- << Previous post in topic Next post in topic >>