Loading ...
Sorry, an error occurred while loading the content.

1005Re: Filename encodings under Win32

Expand Messages
  • Camillo Särs
    Oct 13, 2003
    • 0 Attachment
      Bram Moolenaar wrote:
      > On Windows NT/XP there are also restrictions, especially when using
      > non-NTFS filesystems.

      Right, I forgot about those. AFAIK, the fuctions do not fail silently in
      those cases, so it's just (yet) more work. Essentially, file names then
      come from a restricted charset (code page limits).

      > There was a discussion about this in the Linux UTF-8 maillist a
      > long time ago. There was no good universal solution
      > for handling filenames that they could come up with.

      I bet. For many systems, the current behavior is adequate even if
      technically speaking wrong. I'm not trying to propose a universal
      solution, I'm just advocating the view that on win32, vim should do the
      "windows thing" with unicode/utf-8.

      > Vim could use Unicode functions for accessing files, but this will be a
      > huge change.

      Why so? The code earlier in this thread probably did much of what is
      needed. It also involved numerous other changes, which I ignored. I'm not
      being nosy, I'm just curious why this would be a "huge change". It's not
      the file contents we are getting at, it's the filenames (and the GUI).

      Also note that when using the native code page as the encoding (read:
      latin1), using the ANSI functions do work as expected. So the fixes would
      only need to concern the UTF-8 encoding, if you get picky. :)

      > Requires lots of testing.

      That's unicode for you. However, deriving a decent test set using
      available unicode test files should be a fairly straight-forward thing.

      > Main problem is when 'encoding' is not a Unicode encoding, then conversions
      > need to be done, which may fail.

      But what I assume you are doing now is even worse, isn't it? Essentially
      you are be feeding some user-selected encoding to functions that require
      ANSI characters. How's that for "a lot of testing"?

      Conversions from almost any encoding to unicode should work. I would not
      expect major trouble there. And note that if the conversion from the
      encoding to unicode fails, I expect that the current usage would fail even
      more severely. And there haven't been reports of that, has there?

      There certainly are tricky encodings that could cause problems. However,
      I'm mostly concerned with the basic use case of utf-8 and
      "fileencodings=ucs-bom,utf-8,latin1". This under a code page of cp1252.

      > If you use filenames that cannot be represented in the active codepage,
      > you probably have problems with other programs.

      But I have filenames that can be represented in the active code page
      (å.txt), but which get encoded into incompatible UTF-8 characters!

      > Thus sticking with the active codepage functions isn't too bad.

      If it worked that way, but it doesn't. Setting "encoding=utf-8" changes
      that behavior - only us-ascii is usable in filenames.

      > But then Vim needs to convert from 'encoding' to the active codepage!

      That would help most users. Including me. But it would not be the
      "ultimate" solution to unicode on win32, as it would still cause trouble
      with characters outside the codepage. As I see it, the easiest fix is
      actually using the unicode-api, as there are less (or no) conversion
      failures that way.

      > The file names are handled as byte strings. Thus so long as you use the
      > right bytes it should work. Problem is when you are typing/editing with
      > a different encoding from the active codepage.

      My point exactly! :)

      > Why would 'termencoding' be "utf-8"? This won't work, unless you are
      > using an xterm on MS-Windows.

      Yeah, but that's what you get if you just blindly do "set encoding=utf-8".
      Took me a while to figure that one out. I need to do "set
      termencoding=cp1252" first, or the "let &termencoding = &encoding". Not
      exactly transparent to non-experts.

      > The default 'termencoding' is empty, which means 'encoding' is used.
      > There is no better default.

      On Windows, I'd say "detect active code page" is the right choice.

      > When you change 'encoding' you might have to change 'termencoding' as
      > well, but this depends on your situation.

      As noted above, that's the unintuitive behavior I was getting at. A
      windows user, knowing that unicode is the native charset, does a "set
      encoding=utf-8" and expects things to work. They don't, but depending on
      the language, it may take a while before a non-ascii character is entered.

      >>- The default fileencoding breaks when "going UTF-8", most probably a
      >>better behavior would be to default to the ACP always.
      >
      > 'fileencoding' is set when reading a file. Perhaps you mean
      > 'fileencodings'? This one needs to be tweaked by the user, because it
      > depends on what kind of files you edit. Main problem is that an ASCII
      > file can be any encoding, Vim can't detect what it is, thus the user has
      > to specify what he wants Vim to do with it.

      Yes, I was unclear. Let me elaborate, although this point is rather
      exotic, and you can safely ignore me. :)

      When setting "encoding=utf-8", any new files will suddenly be utf-8 as
      well. For "ordinary" windows users, this may not be the desired result.
      What I was getting at was that *perhaps* the default fileencoding should be
      "cp####" in this case, unless the user explicitly sets it to something else
      (presumably utf-8). Before you object, yes, that's silly.

      Why use "encoding=utf-8" if you still want to create new files as ANSI?
      Well, quite a few windows applications don't do UTF-8. But using UTF-8
      internally still allows users to *transparently* edit existing
      unicode/utf-8 files without conversions.

      Anyway, I digress. This thought of mine was not that bright. Just forget it.

      >>- Also, my vim (6.2) defaults to "latin1", not my current codepage. That
      >>would indicate that the ACP detection does not work.
      >
      > Where does it use "latin1"? Not in 'encoding', I suppose.

      Yes. Without a _vimrc, I get:
      encoding=latin1
      fileencodings=ucs-bom
      termencoding=

      Thus changing the encoding only has funny effects.

      > Mostly it's quite more complicated. Different users have different
      > situations, it is hard to think of solutions that work for most people.

      Well, if you decide to make the unicode implementation work as it should,
      most people should be able to get what they want. It might involve a bit
      of tweaking, but nothing more.

      > The problem is that conversions to/from Unicode only work when you know
      > the encoding of the text you are converting. The encoding isn't always
      > known. Vim sometimes uses "latin1", so that you at least get 8-bit
      > clean editing, even though the actual encoding is unknown.

      I claim that on Windows, you should always have a good idea of the
      encoding. It's either explicitly set by the user, "cp####", or unicode.
      Windows has good support for converting ANSI to unicode, so this should be
      a non-issue. And again, as this is about non-UTF-8 data, you already have
      this problem anyway, because you are calling the ANSI functions with the
      "unknown" data. That it works should prove my point. ;-)

      But in the universal case, I agree with you.

      >>On Win9x, vim should use ANSI apis. The only thing missing is again the
      >>encoding/decoding, although it's trickier with the ANSI apis. There are
      >>many cases where an user would enter UTF-8 stuff that doesn't smootly
      >>convert to the current CP. I think vim's current code should detect that
      >>easily.
      >
      > You can use a few Unicode functions on Win9x, we already do. I don't
      > see a reason to change this.

      Sorry, I didn't want to imply that. I agree that we should stick to the
      unicode functions that are supported on Win9x, and only revert to ANSI
      "when forced".

      Camillo
      --
      Camillo Särs <+ged+@...> ** Aim for the impossible and you
      <http://www.iki.fi/+ged> ** will achieve the improbable.
      PGP public key available **
    • Show all 29 messages in this topic