Loading ...
Sorry, an error occurred while loading the content.

1006Re: Filename encodings under Win32

Expand Messages
  • Bram Moolenaar
    Oct 13, 2003
    • 0 Attachment
      Camillo wrote:

      > > Vim could use Unicode functions for accessing files, but this will be a
      > > huge change.
      >
      > Why so? The code earlier in this thread probably did much of what is
      > needed. It also involved numerous other changes, which I ignored. I'm not
      > being nosy, I'm just curious why this would be a "huge change". It's not
      > the file contents we are getting at, it's the filenames (and the GUI).

      Because every fopen(), stat() etc. will have to be changed.

      > Also note that when using the native code page as the encoding (read:
      > latin1), using the ANSI functions do work as expected. So the fixes would
      > only need to concern the UTF-8 encoding, if you get picky. :)

      This only means extra work, since an "if (encoding == ...)" has to be
      added to select between the traditional file access method and the
      Unicode method.

      > > Requires lots of testing.
      >
      > That's unicode for you. However, deriving a decent test set using
      > available unicode test files should be a fairly straight-forward thing.

      No, it's actually impossible to test this automatically. It involves
      creating various Win32 environments with code page settings, network
      filesystems and installed libraries. Only end-user tests can discover
      the real problems.

      > > Main problem is when 'encoding' is not a Unicode encoding, then conversions
      > > need to be done, which may fail.
      >
      > But what I assume you are doing now is even worse, isn't it? Essentially
      > you are be feeding some user-selected encoding to functions that require
      > ANSI characters. How's that for "a lot of testing"?

      The currently used functions work fine for accessing existing files.
      It's only when typing a new name or when displaying the name that
      problems may occur.

      > Conversions from almost any encoding to unicode should work. I would not
      > expect major trouble there. And note that if the conversion from the
      > encoding to unicode fails, I expect that the current usage would fail even
      > more severely. And there haven't been reports of that, has there?

      Main problem is that sometimes we don't know what the encoding is. In
      that situation you can treat the filename as a sequence of bytes in most
      places, but conversion is impossible. This happens more often than you
      would expect. Put a floppy disk or CD into your computer...

      There is also the situation that Vim uses the active codepage, but the
      file is actually in another encoding that could not be detected. Then
      doing "gf" on a filename will work if you don't do conversion, but it
      will fail if you try converting with the wrong encoding in mind.

      > > Thus sticking with the active codepage functions isn't too bad.
      >
      > If it worked that way, but it doesn't. Setting "encoding=utf-8" changes
      > that behavior - only us-ascii is usable in filenames.

      I don't see why. You can use a file selector to open any file and write
      it back under the same name. Vim doesn't need to know the encoding of
      the filename that way.

      If you type a file name in utf-8 it won't work properly, thus you have
      to use another method to obtain the file name. It's clumsy, I know.

      > > But then Vim needs to convert from 'encoding' to the active codepage!
      >
      > That would help most users. Including me. But it would not be the
      > "ultimate" solution to unicode on win32, as it would still cause trouble
      > with characters outside the codepage. As I see it, the easiest fix is
      > actually using the unicode-api, as there are less (or no) conversion
      > failures that way.

      As said above, this only works if we are 100% sure of what encoding the
      text (filename) is in, and we don't always know that.

      > > Why would 'termencoding' be "utf-8"? This won't work, unless you are
      > > using an xterm on MS-Windows.
      >
      > Yeah, but that's what you get if you just blindly do "set encoding=utf-8".
      > Took me a while to figure that one out. I need to do "set
      > termencoding=cp1252" first, or the "let &termencoding = &encoding". Not
      > exactly transparent to non-experts.

      Setting 'encoding' is full of side effects. There is a clear warning in
      the docs about this.

      > > The default 'termencoding' is empty, which means 'encoding' is used.
      > > There is no better default.
      >
      > On Windows, I'd say "detect active code page" is the right choice.

      I remember this was proposed before, I can't remember why we didn't do
      it this way. Windows is different here, since we can find out what the
      active codepage is. On Unix it's not that clear (e.g., depends on what
      options the xterm was started with). Consistency between systems is
      preferred.

      > >>- Also, my vim (6.2) defaults to "latin1", not my current codepage. That
      > >>would indicate that the ACP detection does not work.
      > >
      > > Where does it use "latin1"? Not in 'encoding', I suppose.
      >
      > Yes. Without a _vimrc, I get:
      > encoding=latin1
      > fileencodings=ucs-bom
      > termencoding=
      >
      > Thus changing the encoding only has funny effects.

      Your active codepage must be latin1 then. Vim gets the default from the
      active codepage.

      --
      hundred-and-one symptoms of being an internet addict:
      192. Your boss asks you to "go fer" coffee and you come up with 235 FTP sites.

      /// Bram Moolenaar -- Bram@... -- http://www.Moolenaar.net \\\
      /// Creator of Vim - Vi IMproved -- http://www.Vim.org \\\
      \\\ Project leader for A-A-P -- http://www.A-A-P.org ///
      \\\ Help AIDS victims, buy here: http://ICCF-Holland.org/click1.html ///
    • Show all 29 messages in this topic