1005Re: Filename encodings under Win32
- Oct 13, 2003Bram Moolenaar wrote:
> On Windows NT/XP there are also restrictions, especially when usingRight, I forgot about those. AFAIK, the fuctions do not fail silently in
> non-NTFS filesystems.
those cases, so it's just (yet) more work. Essentially, file names then
come from a restricted charset (code page limits).
> There was a discussion about this in the Linux UTF-8 maillist aI bet. For many systems, the current behavior is adequate even if
> long time ago. There was no good universal solution
> for handling filenames that they could come up with.
technically speaking wrong. I'm not trying to propose a universal
solution, I'm just advocating the view that on win32, vim should do the
"windows thing" with unicode/utf-8.
> Vim could use Unicode functions for accessing files, but this will be aWhy so? The code earlier in this thread probably did much of what is
> huge change.
needed. It also involved numerous other changes, which I ignored. I'm not
being nosy, I'm just curious why this would be a "huge change". It's not
the file contents we are getting at, it's the filenames (and the GUI).
Also note that when using the native code page as the encoding (read:
latin1), using the ANSI functions do work as expected. So the fixes would
only need to concern the UTF-8 encoding, if you get picky. :)
> Requires lots of testing.That's unicode for you. However, deriving a decent test set using
available unicode test files should be a fairly straight-forward thing.
> Main problem is when 'encoding' is not a Unicode encoding, then conversionsBut what I assume you are doing now is even worse, isn't it? Essentially
> need to be done, which may fail.
you are be feeding some user-selected encoding to functions that require
ANSI characters. How's that for "a lot of testing"?
Conversions from almost any encoding to unicode should work. I would not
expect major trouble there. And note that if the conversion from the
encoding to unicode fails, I expect that the current usage would fail even
more severely. And there haven't been reports of that, has there?
There certainly are tricky encodings that could cause problems. However,
I'm mostly concerned with the basic use case of utf-8 and
"fileencodings=ucs-bom,utf-8,latin1". This under a code page of cp1252.
> If you use filenames that cannot be represented in the active codepage,But I have filenames that can be represented in the active code page
> you probably have problems with other programs.
(å.txt), but which get encoded into incompatible UTF-8 characters!
> Thus sticking with the active codepage functions isn't too bad.If it worked that way, but it doesn't. Setting "encoding=utf-8" changes
that behavior - only us-ascii is usable in filenames.
> But then Vim needs to convert from 'encoding' to the active codepage!That would help most users. Including me. But it would not be the
"ultimate" solution to unicode on win32, as it would still cause trouble
with characters outside the codepage. As I see it, the easiest fix is
actually using the unicode-api, as there are less (or no) conversion
failures that way.
> The file names are handled as byte strings. Thus so long as you use theMy point exactly! :)
> right bytes it should work. Problem is when you are typing/editing with
> a different encoding from the active codepage.
> Why would 'termencoding' be "utf-8"? This won't work, unless you areYeah, but that's what you get if you just blindly do "set encoding=utf-8".
> using an xterm on MS-Windows.
Took me a while to figure that one out. I need to do "set
termencoding=cp1252" first, or the "let &termencoding = &encoding". Not
exactly transparent to non-experts.
> The default 'termencoding' is empty, which means 'encoding' is used.On Windows, I'd say "detect active code page" is the right choice.
> There is no better default.
> When you change 'encoding' you might have to change 'termencoding' asAs noted above, that's the unintuitive behavior I was getting at. A
> well, but this depends on your situation.
windows user, knowing that unicode is the native charset, does a "set
encoding=utf-8" and expects things to work. They don't, but depending on
the language, it may take a while before a non-ascii character is entered.
>>- The default fileencoding breaks when "going UTF-8", most probably aYes, I was unclear. Let me elaborate, although this point is rather
>>better behavior would be to default to the ACP always.
> 'fileencoding' is set when reading a file. Perhaps you mean
> 'fileencodings'? This one needs to be tweaked by the user, because it
> depends on what kind of files you edit. Main problem is that an ASCII
> file can be any encoding, Vim can't detect what it is, thus the user has
> to specify what he wants Vim to do with it.
exotic, and you can safely ignore me. :)
When setting "encoding=utf-8", any new files will suddenly be utf-8 as
well. For "ordinary" windows users, this may not be the desired result.
What I was getting at was that *perhaps* the default fileencoding should be
"cp####" in this case, unless the user explicitly sets it to something else
(presumably utf-8). Before you object, yes, that's silly.
Why use "encoding=utf-8" if you still want to create new files as ANSI?
Well, quite a few windows applications don't do UTF-8. But using UTF-8
internally still allows users to *transparently* edit existing
unicode/utf-8 files without conversions.
Anyway, I digress. This thought of mine was not that bright. Just forget it.
>>- Also, my vim (6.2) defaults to "latin1", not my current codepage. ThatYes. Without a _vimrc, I get:
>>would indicate that the ACP detection does not work.
> Where does it use "latin1"? Not in 'encoding', I suppose.
Thus changing the encoding only has funny effects.
> Mostly it's quite more complicated. Different users have differentWell, if you decide to make the unicode implementation work as it should,
> situations, it is hard to think of solutions that work for most people.
most people should be able to get what they want. It might involve a bit
of tweaking, but nothing more.
> The problem is that conversions to/from Unicode only work when you knowI claim that on Windows, you should always have a good idea of the
> the encoding of the text you are converting. The encoding isn't always
> known. Vim sometimes uses "latin1", so that you at least get 8-bit
> clean editing, even though the actual encoding is unknown.
encoding. It's either explicitly set by the user, "cp####", or unicode.
Windows has good support for converting ANSI to unicode, so this should be
a non-issue. And again, as this is about non-UTF-8 data, you already have
this problem anyway, because you are calling the ANSI functions with the
"unknown" data. That it works should prove my point. ;-)
But in the universal case, I agree with you.
>>On Win9x, vim should use ANSI apis. The only thing missing is again theSorry, I didn't want to imply that. I agree that we should stick to the
>>encoding/decoding, although it's trickier with the ANSI apis. There are
>>many cases where an user would enter UTF-8 stuff that doesn't smootly
>>convert to the current CP. I think vim's current code should detect that
> You can use a few Unicode functions on Win9x, we already do. I don't
> see a reason to change this.
unicode functions that are supported on Win9x, and only revert to ANSI
Camillo Särs <+ged+@...> ** Aim for the impossible and you
<http://www.iki.fi/+ged> ** will achieve the improbable.
PGP public key available **
- << Previous post in topic Next post in topic >>