Loading ...
Sorry, an error occurred while loading the content.

Re: Filename encodings under Win32

Expand Messages
  • Bram Moolenaar
    ... Vim not only supports Unicode but also many other encodings. When Vim would only use Unicode it would be simple, but that s not the situation. And above
    Message 1 of 29 , Oct 14, 2003
    • 0 Attachment
      Camillo wrote:

      > While that may sound attractive at first, I would strongly dissuade from
      > that solution. I consider it to be a myth that using multilingual
      > filenames on Windows is hard. Under NT, it's should be a breeze for any
      > application that is even slightly Unicode-aware. When you decide to make
      > changes in Vim, it makes sense to look to the future and try to go the
      > "Unicode" way. XP Home Edition is gaining ground - fast.

      Vim not only supports Unicode but also many other encodings. When Vim
      would only use Unicode it would be simple, but that's not the situation.
      And above that, Vim is also used on many other systems, and we try to
      make it work the same way everywhere.

      > > This is still complicated, but probably requires less changes than using
      > > Unicode functions for all file access.
      >
      > Why? I don't get it. You don't need to use Unicode functions for anything
      > except stuff that accepts strings. The current implementation is wrong,
      > because it feeds "encoding" text to ANSI functions. If you change it, I
      > don't see why doing a conversion to Unicode would be any different than a
      > conversion to ANSI, other than the fact than converting to ANSI is riskier.
      >
      > <http://www.microsoft.com/globaldev/> contains a lot of useful info. Quote:
      >
      > "All Win32 APIs that take a text argument either as an input or output
      > variable have been provided with a generic function prototype and two
      > definitions: a version that is based on code pages or ANSI (called "A") to
      > handle code page-based text argument and a wide version (called "W ") to
      > handle Unicode."

      Eh, what happens when I use fopen() or stat()? There is no ANSI or wide
      version of these functions. And certainly not one that also works on
      non-Win32 systems. And when using the wide version conversion needs to
      be done from 'encoding' to Unicode, thus the conversion has to be there
      as well. That's going to be a lot of work (many #ifdefs) and will
      probably introduce new bugs.

      > For 9x, you might be interested in the "Microsoft Layer for Unicode"
      >
      > > I only foresee trouble when 'encoding' is set to a non-Unicode
      > > codepage different from the active codepage and using
      > > a filename that contains non-ASCII characters.
      > > Perhaps this situation is too weird to take into account?
      >
      > As long as you know the correct code page, you can use Windows APIs to
      > convert correctly. They take the code page as an argument.

      As mentioned before, we are not always sure what encoding the text has.
      Conversion is then likely to fail. This especially happens for 8-bit
      encodings, there is no way to automatically check what encoding these
      files are.

      I think we need a smart solution that doesn't attempt to handle all
      situations but works predictably.

      --
      hundred-and-one symptoms of being an internet addict:
      218. Your spouse hands you a gift wrapped magnet with your PC's name
      on it and you accuse him or her of genocide.

      /// Bram Moolenaar -- Bram@... -- http://www.Moolenaar.net \\\
      /// Creator of Vim - Vi IMproved -- http://www.Vim.org \\\
      \\\ Project leader for A-A-P -- http://www.A-A-P.org ///
      \\\ Help AIDS victims, buy here: http://ICCF-Holland.org/click1.html ///
    • Glenn Maynard
      ... It s not at all a myth if you want code that is 1: portable and 2: works on 9x, too. (If you can deal with nonportable code, you can use Windows s TCHAR
      Message 2 of 29 , Oct 14, 2003
      • 0 Attachment
        > While that may sound attractive at first, I would strongly dissuade from
        > that solution. I consider it to be a myth that using multilingual
        > filenames on Windows is hard. Under NT, it's should be a breeze for any

        It's not at all a myth if you want code that is 1: portable and 2: works
        on 9x, too. (If you can deal with nonportable code, you can use Windows's
        TCHAR mechanism, and if you don't care about anything but NT, you can write
        a UTF-16-only app. Neither of these are the case here, though.)

        It's not "hard", it's just "incredibly annoying".

        On Tue, Oct 14, 2003 at 02:20:27PM +0200, Bram Moolenaar wrote:
        > This is still complicated, but probably requires less changes than using
        > Unicode functions for all file access. I only foresee trouble when
        > 'encoding' is set to a non-Unicode codepage different from the active
        > codepage and using a filename that contains non-ASCII characters.
        > Perhaps this situation is too weird to take into account?

        If "encoding" is not the ACP codepage, then the main problem is that the
        user can enter characters that Vim simply can't put into a filename
        (and in 9x, that the system can't, either).

        I'd just do a conversion, and if the conversion fails, warn appropriately.

        > Eh, what happens when I use fopen() or stat()? There is no ANSI or wide
        > version of these functions. And certainly not one that also works on
        > non-Win32 systems. And when using the wide version conversion needs to
        > be done from 'encoding' to Unicode, thus the conversion has to be there
        > as well. That's going to be a lot of work (many #ifdefs) and will
        > probably introduce new bugs.

        It's not that much work. Windows has _wfopen and _wstat. Vim already
        has those abstracted (mch_fopen, mch_stat), so conversions would only
        happen in one place (and in a place that's intended to be platform-
        specific, mch_*). I believe the code I linked earlier did exactly this.

        The only thing needed is sane error recovery.

        > Yep, using conversions means failure is possible. And failure mostly
        > means the text is in a different encoding than expected. It would take
        > some time to figure out how to do this in a way that the user isn't
        > confused.

        Well, bear in mind the non-ACP case that already exists. If I create
        "foo ♡.txt", and try to edit it with Vim, it edits "foo ?.txt" (which
        it can't write, either, since "?" is an invalid character in Windows
        filenames). I'd suggest that editing a file with an invalid character
        (eg. invalid SJIS sequence) behave identically to editing a file with
        a valid character that can't be referenced (eg. "foo ♡.txt").

        --
        Glenn Maynard
      • Camillo Särs
        ... Agreed. There s no way around that. ... Sounds very promising. It would be really great if it turns out that the changes are fairly minor. That way
        Message 3 of 29 , Oct 14, 2003
        • 0 Attachment
          Glenn Maynard wrote:
          > If "encoding" is not the ACP codepage, then the main problem is that the
          > user can enter characters that Vim simply can't put into a filename
          > (and in 9x, that the system can't, either).
          >
          > I'd just do a conversion, and if the conversion fails, warn appropriately.

          Agreed. There's no way around that.

          > It's not that much work. Windows has _wfopen and _wstat. Vim already
          > has those abstracted (mch_fopen, mch_stat), so conversions would only
          > happen in one place (and in a place that's intended to be platform-
          > specific, mch_*). I believe the code I linked earlier did exactly this.
          >
          > The only thing needed is sane error recovery.

          Sounds very promising. It would be really great if it turns out that the
          changes are fairly minor. That way there's a chance they would get
          implemented. :)

          If you decide to try the proposed changes out, I'm prepared to do some
          testing on a Win32 binary build. Sorry, can't build myself. :(

          Camillo
          --
          Camillo Särs <+ged+@...> ** Aim for the impossible and you
          <http://www.iki.fi/+ged> ** will achieve the improbable.
          PGP public key available **
        • Bram Moolenaar
          ... It s more complicated then that. You can have filenames in the ACP, encoding and Unicode. Filenames are stored in various places inside Vim, which
          Message 4 of 29 , Oct 15, 2003
          • 0 Attachment
            Glenn Maynard wrote:

            > On Tue, Oct 14, 2003 at 02:20:27PM +0200, Bram Moolenaar wrote:
            > > This is still complicated, but probably requires less changes than using
            > > Unicode functions for all file access. I only foresee trouble when
            > > 'encoding' is set to a non-Unicode codepage different from the active
            > > codepage and using a filename that contains non-ASCII characters.
            > > Perhaps this situation is too weird to take into account?
            >
            > If "encoding" is not the ACP codepage, then the main problem is that the
            > user can enter characters that Vim simply can't put into a filename
            > (and in 9x, that the system can't, either).
            >
            > I'd just do a conversion, and if the conversion fails, warn appropriately.

            It's more complicated then that. You can have filenames in the ACP,
            'encoding' and Unicode. Filenames are stored in various places inside
            Vim, which encoding is used for each of them? Obviously, a filename
            stored in buffer text and registers has to use 'encoding'.

            It's less obvious what to use for internal structures, such as
            curbuf->b_ffname. When 'encoding' is a Unicode encoding we can use
            UTF-8, that can be converted to anything else. That also works when the
            active codepage is not Unicode, we can use the wide functions then.

            When 'encoding' is the active codepage (this is the default, should
            happen a lot), we can use the active codepage. That avoids conversions
            (which may fail). No need to use wide functions then.

            The real problem is when 'encoding' is not the active codepage and it's
            also not a Unicode encoding. We could simply skip the conversion then.
            That doesn't work properly for non-ASCII characters, but it's how it
            already works right now. The right way would be to convert the file
            name to Unicode and use the wide functions.

            I guess this means all filenames inside Vim are in 'encoding'. Where
            needed, conversion needs to be done from/to Unicode and the wide
            functions are to be used then.

            The main thing to implement now is using the wide functions when
            'encoding' is UTF-8. This only requires a simple conversion between
            UTF-8 and UCS-16. I'll be waiting for a patch...

            --
            hundred-and-one symptoms of being an internet addict:
            231. You sprinkle Carpet Fresh on the rugs and put your vacuum cleaner
            in the front doorway permanently so it always looks like you are
            actually attempting to do something about that mess that has amassed
            since you discovered the Internet.

            /// Bram Moolenaar -- Bram@... -- http://www.Moolenaar.net \\\
            /// Creator of Vim - Vi IMproved -- http://www.Vim.org \\\
            \\\ Project leader for A-A-P -- http://www.A-A-P.org ///
            \\\ Help AIDS victims, buy here: http://ICCF-Holland.org/click1.html ///
          Your message has been successfully submitted and would be delivered to recipients shortly.