Loading ...
Sorry, an error occurred while loading the content.

Re: Filename encodings under Win32

Expand Messages
  • Bram Moolenaar
    ... A file name may appear in a file (e.g., a list of files in a README file). And I don t know what happens with file names on removable media (e.g., a CD).
    Message 1 of 29 , Oct 13, 2003
    • 0 Attachment
      Camillo wrote:

      > > Main problem is that sometimes we don't know what the encoding is.
      >
      > On Windows? I would disagree here. Any filesystem mounted by Windows
      > should be mounted in a way that adheres to Windows naming conventions.
      > We're not discussing file contents here.

      A file name may appear in a file (e.g., a list of files in a README
      file). And I don't know what happens with file names on removable media
      (e.g., a CD). Probably depends on the file system it contains. And
      networked file systems is another problem.

      > > In that situation you can treat the filename as a sequence of bytes in most
      > > places, but conversion is impossible. This happens more often than you
      > > would expect. Put a floppy disk or CD into your computer...
      >
      > So why convert it? :) The current display/saving problems stem from the
      > fact that the file name is interpreted as UTF-8, a coding which Windows
      > does not recognize for file names or strings.

      We need to locate places where the encoding is different from what a
      system function expects. There are still a few things that need to be
      fixed.

      > > There is also the situation that Vim uses the active codepage, but the
      > > file is actually in another encoding that could not be detected. Then
      > > doing "gf" on a filename will work if you don't do conversion, but it
      > > will fail if you try converting with the wrong encoding in mind.
      >
      > AFAIK, Windows will internally convert the path into Unicode if you call
      > the ANSI function. Thus if gf succeeds as you describe, it should succeed
      > if you use the unicode api as well. In both cases a 8-bit binary string
      > undergoes "cp2unicode" conversion.

      If Vim defaults to the active codepage then conversion to Unicode would
      do the same as using the ANSI function. Thus it's only a problem when
      'encoding' is different from the active codepage. And when 'encoding'
      is a Unicode variant we can use the "W" functions. Still, this means
      all fopen() and stat() calls must be adjusted. When 'encoding' is not
      the active codepage we could either leave the file name untranslated (as
      it's now) or convert it to Unicode. Don't know which one would work
      best...

      > > Your active codepage must be latin1 then. Vim gets the default from the
      > > active codepage.
      >
      > My code page is cp1252. It's not latin1 (iso-8859-1). In practice, both
      > are 8-bit-raw.

      cp1252 and latin1 are not identical, but for practical use they can be
      handled as the same encoding. Vim indeed uses this as the "raw" 8-bit
      encoding that avoids messing up your characters when you don't know what
      encoding it actually is.

      --
      hundred-and-one symptoms of being an internet addict:
      194. Your business cards contain your e-mail and home page address.

      /// Bram Moolenaar -- Bram@... -- http://www.Moolenaar.net \\\
      /// Creator of Vim - Vi IMproved -- http://www.Vim.org \\\
      \\\ Project leader for A-A-P -- http://www.A-A-P.org ///
      \\\ Help AIDS victims, buy here: http://ICCF-Holland.org/click1.html ///
    • Glenn Maynard
      Note that I ve upgraded, and I m not having problems with files saving incorrectly in enc=utf-8. The remaining problems are mostly cosmetic, except for not
      Message 2 of 29 , Oct 13, 2003
      • 0 Attachment
        Note that I've upgraded, and I'm not having problems with files saving
        incorrectly in enc=utf-8. The remaining problems are mostly cosmetic,
        except for not being able to ":w 漢字.txt" with the ACP being Japanese.

        On Mon, Oct 13, 2003 at 02:25:04PM +0200, Bram Moolenaar wrote:
        > Because every fopen(), stat() etc. will have to be changed.

        I don't think handling Unicode in filenames is worth it in Windows. It
        takes so much work that the only applications I know of that support it
        are ones that are compiled as native Unicode apps. The only exception
        I've seen is FB2k.

        It's certainly useful to be able to have multilingual filenames, but
        Windows makes it so hard that people really wanting to do that probably
        need a new OS.

        > I don't see why. You can use a file selector to open any file and write
        > it back under the same name. Vim doesn't need to know the encoding of
        > the filename that way.

        Consider the case where a filename in NT contains illegal data, eg. an
        invalid two-byte SJIS sequence. When you call NT ANSI system calls, it
        converts the buffers you pass it to WCHAR. That conversion would fail.

        Are you worried about not being able to open files off eg. a slightly
        corrupt/malformed floppy disc containing filenames that won't convert
        cleanly? That seems no worse than not being able to use non-ACP
        filenames. If that works, it seems a poor trade for not being able to
        enter non-ASCII filenames in utf-8. ":w 漢字.txt" responding with
        '"漢字.txt" [New]' and writing the filename correctly seems pretty
        fundamental, for Japanese users on Japanese systems, and that doesn't
        work with enc=utf-8.

        > I remember this was proposed before, I can't remember why we didn't do
        > it this way. Windows is different here, since we can find out what the
        > active codepage is. On Unix it's not that clear (e.g., depends on what
        > options the xterm was started with). Consistency between systems is
        > preferred.

        Windows and Unix handle encodings fundamentally differently, so complete
        consistency means one or the other system not working as well. It seems
        like "consistency to a fault". :)

        Here's what I see, though: Windows APIs are always giving ACP or Unicode
        data. Vim honors that for some code paths: input methods, copying to
        and from the system clipboard. It ignores it and uses Unix paradigms
        for others: filenames, most other ANSI calls.

        The former, in my experience, work consistently; I can enter text with
        the IME in both UTF-8 and CP932, and copy and paste reliably. The
        latter do not: entered filenames don't work, non-ASCII text in the
        titlebar shows <ab> hex values.

        --
        Glenn Maynard
      • Camillo Särs
        ... Both floppies, CDs and network file systems are mounted by windows, and some translation of file names happens. AFAIK, you should be able to access all
        Message 3 of 29 , Oct 13, 2003
        • 0 Attachment
          Bram Moolenaar wrote:
          > A file name may appear in a file (e.g., a list of files in a README
          > file). And I don't know what happens with file names on removable media
          > (e.g., a CD). Probably depends on the file system it contains. And
          > networked file systems is another problem.

          Both floppies, CDs and network file systems are mounted by windows, and
          "some" translation of file names happens. AFAIK, you should be able to
          access all files on such file systems using WindowsNT naming conventions.
          The file names may not be exactly what you anticipated, but they are
          guaranteed to stay constant.

          > We need to locate places where the encoding is different from what a
          > system function expects. There are still a few things that need to be
          > fixed.

          Yup. As I'm not familiar with the vim sources, I don't know how much work
          this would mean in reality. However, the set of functions is or should be
          known, and fairly limited.

          > When 'encoding' is not the active codepage we could either leave
          > the file name untranslated (as it's now) or convert it to Unicode.
          > Don't know which one would work best...

          Me neither. But I think that a conversion to unicode should be "fairly"
          straight-forward, as it is what NT does natively anyway. This leads me to
          think that Vim should do the conversion, as it knows the encoding. Or
          let's say, it thinks it knows it. :)

          Cheers,
          Camillo
          --
          Camillo Särs <+ged+@...> ** Aim for the impossible and you
          <http://www.iki.fi/+ged> ** will achieve the improbable.
          PGP public key available **
        • Bram Moolenaar
          ... So, what you suggest is to keep using the ordinary file system functions. But we must make sure that the file name is then in the active codepage
          Message 4 of 29 , Oct 14, 2003
          • 0 Attachment
            Glenn Maynard wrote:

            > On Mon, Oct 13, 2003 at 02:25:04PM +0200, Bram Moolenaar wrote:
            > > Because every fopen(), stat() etc. will have to be changed.
            >
            > I don't think handling Unicode in filenames is worth it in Windows. It
            > takes so much work that the only applications I know of that support it
            > are ones that are compiled as native Unicode apps. The only exception
            > I've seen is FB2k.
            >
            > It's certainly useful to be able to have multilingual filenames, but
            > Windows makes it so hard that people really wanting to do that probably
            > need a new OS.

            So, what you suggest is to keep using the ordinary file system
            functions. But we must make sure that the file name is then in the
            active codepage encoding. When obtaining the file name with a system
            function (e.g., a directory listing or file browser) it will already be
            in that encoding. But when the user types a file name it's in the
            encoding specified with 'encoding'. This means we would need to convert
            the file name from 'encoding' to the active codepage at some point.
            And the reverse conversion is needed when using a filename as a text
            string, e.g., for "%p and in the window title.

            This is still complicated, but probably requires less changes than using
            Unicode functions for all file access. I only foresee trouble when
            'encoding' is set to a non-Unicode codepage different from the active
            codepage and using a filename that contains non-ASCII characters.
            Perhaps this situation is too weird to take into account?

            > > I don't see why. You can use a file selector to open any file and write
            > > it back under the same name. Vim doesn't need to know the encoding of
            > > the filename that way.
            >
            > Consider the case where a filename in NT contains illegal data, eg. an
            > invalid two-byte SJIS sequence. When you call NT ANSI system calls, it
            > converts the buffers you pass it to WCHAR. That conversion would fail.
            >
            > Are you worried about not being able to open files off eg. a slightly
            > corrupt/malformed floppy disc containing filenames that won't convert
            > cleanly? That seems no worse than not being able to use non-ACP
            > filenames. If that works, it seems a poor trade for not being able to
            > enter non-ASCII filenames in utf-8. ":w $B4A;z(B.txt"
            > responding with '"$B4A;z(B.txt" [New]' and writing the
            > filename correctly seems pretty fundamental, for Japanese users on
            > Japanese systems, and that doesn't work with enc=utf-8.

            Yep, using conversions means failure is possible. And failure mostly
            means the text is in a different encoding than expected. It would take
            some time to figure out how to do this in a way that the user isn't
            confused.

            --
            hundred-and-one symptoms of being an internet addict:
            210. When you get a divorce, you don't care about who gets the children,
            but discuss endlessly who can use the email address.

            /// Bram Moolenaar -- Bram@... -- http://www.Moolenaar.net \\\
            /// Creator of Vim - Vi IMproved -- http://www.Vim.org \\\
            \\\ Project leader for A-A-P -- http://www.A-A-P.org ///
            \\\ Help AIDS victims, buy here: http://ICCF-Holland.org/click1.html ///
          • Camillo Särs
            ... While that may sound attractive at first, I would strongly dissuade from that solution. I consider it to be a myth that using multilingual filenames on
            Message 5 of 29 , Oct 14, 2003
            • 0 Attachment
              Bram Moolenaar wrote:
              > Glenn Maynard wrote:
              >>It's certainly useful to be able to have multilingual filenames, but
              >>Windows makes it so hard that people really wanting to do that probably
              >>need a new OS.
              >
              > So, what you suggest is to keep using the ordinary file system
              > functions. But we must make sure that the file name is then in the
              > active codepage encoding.

              While that may sound attractive at first, I would strongly dissuade from
              that solution. I consider it to be a myth that using multilingual
              filenames on Windows is hard. Under NT, it's should be a breeze for any
              application that is even slightly Unicode-aware. When you decide to make
              changes in Vim, it makes sense to look to the future and try to go the
              "Unicode" way. XP Home Edition is gaining ground - fast.

              Win9x is a mess, because it's just a version of DOS on hormones, and thus
              is solidly entrenched in the single code page per application world. Using
              the current code page should suffice there, though.

              > This is still complicated, but probably requires less changes than using
              > Unicode functions for all file access.

              Why? I don't get it. You don't need to use Unicode functions for anything
              except stuff that accepts strings. The current implementation is wrong,
              because it feeds "encoding" text to ANSI functions. If you change it, I
              don't see why doing a conversion to Unicode would be any different than a
              conversion to ANSI, other than the fact than converting to ANSI is riskier.

              <http://www.microsoft.com/globaldev/> contains a lot of useful info. Quote:

              "All Win32 APIs that take a text argument either as an input or output
              variable have been provided with a generic function prototype and two
              definitions: a version that is based on code pages or ANSI (called "A") to
              handle code page-based text argument and a wide version (called "W ") to
              handle Unicode."

              For 9x, you might be interested in the "Microsoft Layer for Unicode"

              > I only foresee trouble when 'encoding' is set to a non-Unicode
              > codepage different from the active codepage and using
              > a filename that contains non-ASCII characters.
              > Perhaps this situation is too weird to take into account?

              As long as you know the correct code page, you can use Windows APIs to
              convert correctly. They take the code page as an argument.

              Camillo
              --
              Camillo Särs <+ged+@...> ** Aim for the impossible and you
              <http://www.iki.fi/+ged> ** will achieve the improbable.
              PGP public key available **
            • Bram Moolenaar
              ... Vim not only supports Unicode but also many other encodings. When Vim would only use Unicode it would be simple, but that s not the situation. And above
              Message 6 of 29 , Oct 14, 2003
              • 0 Attachment
                Camillo wrote:

                > While that may sound attractive at first, I would strongly dissuade from
                > that solution. I consider it to be a myth that using multilingual
                > filenames on Windows is hard. Under NT, it's should be a breeze for any
                > application that is even slightly Unicode-aware. When you decide to make
                > changes in Vim, it makes sense to look to the future and try to go the
                > "Unicode" way. XP Home Edition is gaining ground - fast.

                Vim not only supports Unicode but also many other encodings. When Vim
                would only use Unicode it would be simple, but that's not the situation.
                And above that, Vim is also used on many other systems, and we try to
                make it work the same way everywhere.

                > > This is still complicated, but probably requires less changes than using
                > > Unicode functions for all file access.
                >
                > Why? I don't get it. You don't need to use Unicode functions for anything
                > except stuff that accepts strings. The current implementation is wrong,
                > because it feeds "encoding" text to ANSI functions. If you change it, I
                > don't see why doing a conversion to Unicode would be any different than a
                > conversion to ANSI, other than the fact than converting to ANSI is riskier.
                >
                > <http://www.microsoft.com/globaldev/> contains a lot of useful info. Quote:
                >
                > "All Win32 APIs that take a text argument either as an input or output
                > variable have been provided with a generic function prototype and two
                > definitions: a version that is based on code pages or ANSI (called "A") to
                > handle code page-based text argument and a wide version (called "W ") to
                > handle Unicode."

                Eh, what happens when I use fopen() or stat()? There is no ANSI or wide
                version of these functions. And certainly not one that also works on
                non-Win32 systems. And when using the wide version conversion needs to
                be done from 'encoding' to Unicode, thus the conversion has to be there
                as well. That's going to be a lot of work (many #ifdefs) and will
                probably introduce new bugs.

                > For 9x, you might be interested in the "Microsoft Layer for Unicode"
                >
                > > I only foresee trouble when 'encoding' is set to a non-Unicode
                > > codepage different from the active codepage and using
                > > a filename that contains non-ASCII characters.
                > > Perhaps this situation is too weird to take into account?
                >
                > As long as you know the correct code page, you can use Windows APIs to
                > convert correctly. They take the code page as an argument.

                As mentioned before, we are not always sure what encoding the text has.
                Conversion is then likely to fail. This especially happens for 8-bit
                encodings, there is no way to automatically check what encoding these
                files are.

                I think we need a smart solution that doesn't attempt to handle all
                situations but works predictably.

                --
                hundred-and-one symptoms of being an internet addict:
                218. Your spouse hands you a gift wrapped magnet with your PC's name
                on it and you accuse him or her of genocide.

                /// Bram Moolenaar -- Bram@... -- http://www.Moolenaar.net \\\
                /// Creator of Vim - Vi IMproved -- http://www.Vim.org \\\
                \\\ Project leader for A-A-P -- http://www.A-A-P.org ///
                \\\ Help AIDS victims, buy here: http://ICCF-Holland.org/click1.html ///
              • Glenn Maynard
                ... It s not at all a myth if you want code that is 1: portable and 2: works on 9x, too. (If you can deal with nonportable code, you can use Windows s TCHAR
                Message 7 of 29 , Oct 14, 2003
                • 0 Attachment
                  > While that may sound attractive at first, I would strongly dissuade from
                  > that solution. I consider it to be a myth that using multilingual
                  > filenames on Windows is hard. Under NT, it's should be a breeze for any

                  It's not at all a myth if you want code that is 1: portable and 2: works
                  on 9x, too. (If you can deal with nonportable code, you can use Windows's
                  TCHAR mechanism, and if you don't care about anything but NT, you can write
                  a UTF-16-only app. Neither of these are the case here, though.)

                  It's not "hard", it's just "incredibly annoying".

                  On Tue, Oct 14, 2003 at 02:20:27PM +0200, Bram Moolenaar wrote:
                  > This is still complicated, but probably requires less changes than using
                  > Unicode functions for all file access. I only foresee trouble when
                  > 'encoding' is set to a non-Unicode codepage different from the active
                  > codepage and using a filename that contains non-ASCII characters.
                  > Perhaps this situation is too weird to take into account?

                  If "encoding" is not the ACP codepage, then the main problem is that the
                  user can enter characters that Vim simply can't put into a filename
                  (and in 9x, that the system can't, either).

                  I'd just do a conversion, and if the conversion fails, warn appropriately.

                  > Eh, what happens when I use fopen() or stat()? There is no ANSI or wide
                  > version of these functions. And certainly not one that also works on
                  > non-Win32 systems. And when using the wide version conversion needs to
                  > be done from 'encoding' to Unicode, thus the conversion has to be there
                  > as well. That's going to be a lot of work (many #ifdefs) and will
                  > probably introduce new bugs.

                  It's not that much work. Windows has _wfopen and _wstat. Vim already
                  has those abstracted (mch_fopen, mch_stat), so conversions would only
                  happen in one place (and in a place that's intended to be platform-
                  specific, mch_*). I believe the code I linked earlier did exactly this.

                  The only thing needed is sane error recovery.

                  > Yep, using conversions means failure is possible. And failure mostly
                  > means the text is in a different encoding than expected. It would take
                  > some time to figure out how to do this in a way that the user isn't
                  > confused.

                  Well, bear in mind the non-ACP case that already exists. If I create
                  "foo ♡.txt", and try to edit it with Vim, it edits "foo ?.txt" (which
                  it can't write, either, since "?" is an invalid character in Windows
                  filenames). I'd suggest that editing a file with an invalid character
                  (eg. invalid SJIS sequence) behave identically to editing a file with
                  a valid character that can't be referenced (eg. "foo ♡.txt").

                  --
                  Glenn Maynard
                • Camillo Särs
                  ... Agreed. There s no way around that. ... Sounds very promising. It would be really great if it turns out that the changes are fairly minor. That way
                  Message 8 of 29 , Oct 14, 2003
                  • 0 Attachment
                    Glenn Maynard wrote:
                    > If "encoding" is not the ACP codepage, then the main problem is that the
                    > user can enter characters that Vim simply can't put into a filename
                    > (and in 9x, that the system can't, either).
                    >
                    > I'd just do a conversion, and if the conversion fails, warn appropriately.

                    Agreed. There's no way around that.

                    > It's not that much work. Windows has _wfopen and _wstat. Vim already
                    > has those abstracted (mch_fopen, mch_stat), so conversions would only
                    > happen in one place (and in a place that's intended to be platform-
                    > specific, mch_*). I believe the code I linked earlier did exactly this.
                    >
                    > The only thing needed is sane error recovery.

                    Sounds very promising. It would be really great if it turns out that the
                    changes are fairly minor. That way there's a chance they would get
                    implemented. :)

                    If you decide to try the proposed changes out, I'm prepared to do some
                    testing on a Win32 binary build. Sorry, can't build myself. :(

                    Camillo
                    --
                    Camillo Särs <+ged+@...> ** Aim for the impossible and you
                    <http://www.iki.fi/+ged> ** will achieve the improbable.
                    PGP public key available **
                  • Bram Moolenaar
                    ... It s more complicated then that. You can have filenames in the ACP, encoding and Unicode. Filenames are stored in various places inside Vim, which
                    Message 9 of 29 , Oct 15, 2003
                    • 0 Attachment
                      Glenn Maynard wrote:

                      > On Tue, Oct 14, 2003 at 02:20:27PM +0200, Bram Moolenaar wrote:
                      > > This is still complicated, but probably requires less changes than using
                      > > Unicode functions for all file access. I only foresee trouble when
                      > > 'encoding' is set to a non-Unicode codepage different from the active
                      > > codepage and using a filename that contains non-ASCII characters.
                      > > Perhaps this situation is too weird to take into account?
                      >
                      > If "encoding" is not the ACP codepage, then the main problem is that the
                      > user can enter characters that Vim simply can't put into a filename
                      > (and in 9x, that the system can't, either).
                      >
                      > I'd just do a conversion, and if the conversion fails, warn appropriately.

                      It's more complicated then that. You can have filenames in the ACP,
                      'encoding' and Unicode. Filenames are stored in various places inside
                      Vim, which encoding is used for each of them? Obviously, a filename
                      stored in buffer text and registers has to use 'encoding'.

                      It's less obvious what to use for internal structures, such as
                      curbuf->b_ffname. When 'encoding' is a Unicode encoding we can use
                      UTF-8, that can be converted to anything else. That also works when the
                      active codepage is not Unicode, we can use the wide functions then.

                      When 'encoding' is the active codepage (this is the default, should
                      happen a lot), we can use the active codepage. That avoids conversions
                      (which may fail). No need to use wide functions then.

                      The real problem is when 'encoding' is not the active codepage and it's
                      also not a Unicode encoding. We could simply skip the conversion then.
                      That doesn't work properly for non-ASCII characters, but it's how it
                      already works right now. The right way would be to convert the file
                      name to Unicode and use the wide functions.

                      I guess this means all filenames inside Vim are in 'encoding'. Where
                      needed, conversion needs to be done from/to Unicode and the wide
                      functions are to be used then.

                      The main thing to implement now is using the wide functions when
                      'encoding' is UTF-8. This only requires a simple conversion between
                      UTF-8 and UCS-16. I'll be waiting for a patch...

                      --
                      hundred-and-one symptoms of being an internet addict:
                      231. You sprinkle Carpet Fresh on the rugs and put your vacuum cleaner
                      in the front doorway permanently so it always looks like you are
                      actually attempting to do something about that mess that has amassed
                      since you discovered the Internet.

                      /// Bram Moolenaar -- Bram@... -- http://www.Moolenaar.net \\\
                      /// Creator of Vim - Vi IMproved -- http://www.Vim.org \\\
                      \\\ Project leader for A-A-P -- http://www.A-A-P.org ///
                      \\\ Help AIDS victims, buy here: http://ICCF-Holland.org/click1.html ///
                    Your message has been successfully submitted and would be delivered to recipients shortly.