Loading ...
Sorry, an error occurred while loading the content.

997Re: Filename encodings under Win32

Expand Messages
  • Glenn Maynard
    Oct 12, 2003
    • 0 Attachment
      On Mon, Oct 13, 2003 at 05:21:04AM +0200, Tony Mechelynck wrote:
      > I understood you as meaning that the program-default setting should be
      > Unicode. I beg to differ, however. Or maybe I misunderstood what you were
      > saying. And whatever the program-default settings, Vim should (IMHO) work in
      > as constant a manner as possible across all platforms.

      I believe that the *internal* encoding ("encoding") can, if the various
      bugs are fixed, reasonably be UTF-8, unless there's outcry about memory
      usage. I agree that it's very important that keyboard input, file
      reading and writing, and so on operate in the ACP by default.

      > Now let's say I change 'encoding' to "utf-8". With 'termencoding' left empty
      > (the default), gvim now suddenly expects the keyboard to be sending UTF-8
      > byte sequences (because an empty 'termencoding' means it takes the same
      > value as whatever is the current vazlue of 'encoding'). Windows, however, is

      Right: I believe this is poor behavior for Windows. Windows input is
      always in the ACP[1], and if it's not, it should always be possible to find
      out what it is. (That is, I don't know exactly what Windows does if you
      have multiple keyboard mappings and change languages, but it shouldn't
      require special changing of tenc.)

      For example, Vim always expects data from the IME in the encoding it
      sends (Unicode). termencoding is not used. If I set tenc=cp1242, I
      can still enter Japanese kanji with the IME--Vim knows that data is
      alwyas in the same format, and handles it correctly, even though it's
      not CP1242. Keyboard input is the same: the encoding should always
      be predictable.

      (I don't know if anyone is using tenc in Windows to do weird things;
      I can't think of any practical use for intentionally setting tenc to
      a value that doesn't match the ACP.)

      > It may be useful in itself; but until and unless it is indeed (as you
      > suggest) incorporated in mainline Vim source (a possibility towards which
      > I'm not averse as long as it doesn't break something else), it "doesn't
      > exist" from where I sit.

      That's nice, but not relevant. :) Again, I wasn't suggesting anyone
      use the Vim script I supplied, but only using it to demonstrate what the
      internal defaults could be.

      > Indeed. The difference is virtually nil for English; it is small but nonzero
      > for other Latin-alphabet languages, it approaches 1 to 2 for other-alphabet
      > languages like Greek or Russian (a little less than that because of spaces,
      > commas, full stops, etc.); I don't know the ratio for languages like hindi
      > (with nagari script) or Chinese (hanzi).

      The penalty is about 50% for CJK languages (two byte encodings become
      three byte sequences).

      > By the way: what do you mean by ACP? The currently "active code page" maybe?

      ANSI codepage. It's the system codepage, set in the "regional settigs"
      control panel (or whatever; MS changes the control panels weekly). It's
      the codepage that "*A" (ANSI) functions expect (which are the ones Vim
      uses, for the most part). Essentially, the ACP is to Windows 9x as
      "encoding" is to Vim. In NT, everything is UCS-16 internally--or
      is it UTF-16?--and the "*A" functions convert to and from the ACP.

      In a sense, MS did with NT what I wish Vim would do--standardize on Unicode
      internally, to make the internals simpler, in a way that is transparent
      to users.

      > Hm. Your "kanji in filenames" issue makes me think: could that be related to
      > the fact that my Netscape 7 cannot properly handle Cyrillic letters between
      > <title></title> HTML tags (what sits there displays on the title bar, and
      > anything out-of-the-way is accepted but doesn't display properly, IIRC not
      > even with a <meta> tag specifying that the page is in UTF-8) but can show
      > them with no problems in body text, for instance between <H1></H1> (where
      > the title could appear again, this time to be displayed on top of the text
      > inside the browser window)? But this paragraph may be drifting off-topic.

      It's related, but not exactly the same.

      Vim's problem with titlebars is that it's not converting titlebar
      strings to the ACP. ("桜.txt" shows up as <8d><f7>.txt, and 8df7
      looks like the Unicode value of 桜; I'm not entirely sure how that's
      happening and havn't looked at the code.) Fixing this will allow
      displaying characters in the ANSI codepage: a system set to Japanese
      will be able to display Kanji, but not Arabic.

      For displaying full Unicode, it needs to test if Unicode is available,
      create a Unicode window (instead of an ANSI window), and set the title
      with the corresponding wide function. This isn't too hard, but it does
      take more work and a great deal more testing (to make sure it doesn't
      break anything in 9x). This would be nice, but it's above and beyond
      "don't break anything in UTF-8 that works in the normal ANSI codepage".

      Whoops. I just tried saving "桜.txt", and ended up with "(garbage)÷.txt".
      That explains the "<8d><f7>.txt". Looks like file saving isn't working
      right when enc=utf-8. This is a much more serious bug, but not one I'm
      up to fixing right now, as, like you, I rarely edit files with non-ASCII
      characters in the filename. (I'm still using 6.1, though, so this might
      well be fixed.)

      [1] or in Unicode in NT if you use the correct Windows messages, but I
      don't recall which of those work in 9x (probably none)

      --
      Glenn Maynard
    • Show all 29 messages in this topic