Loading ...
Sorry, an error occurred while loading the content.

windows and unicode filenames, etc.

Expand Messages
  • Glenn Maynard
    Just a quick update on other fixes: Editing files with Unicode in the filename that don t fit in the ANSI codepage doesn t work. Fixed, except for the
    Message 1 of 6 , Aug 4, 2002
    • 0 Attachment
      Just a quick update on other fixes:

      Editing files with Unicode in the filename that don't fit in the ANSI
      codepage doesn't work. Fixed, except for the browser, and except for
      renaming (since I don't really want to go near win32's mch_rename, but
      it does need fixing.)

      The IME didn't work when encoding=ucs2 (or other utf-8-but-not
      encodings.) Fixed: use ucs2_to_penc. This also does away with
      CONV_UCS2_TO_DBCS and CONV_DBCS_TO_UCS2 completely.

      Treat all Windows codepages as DBCS, since there's no difference between
      DBCS and SBCS in Windows (SBCS is just DBCS with no lead-bytes). We
      talked about this, and I've hit more problems due to enc_dbcs not being
      set when encoding is set to an SBCS CP.

      Non-ASCII in the titlebar is problematic. I know the problem and the fix,
      but I havn't done this yet.

      I've added getacp() for Windows, to get the active codepage; this allows
      people to use the enc=utf-8;fencs=ucs-bom,utf-8,cp####,latin1;fenc=cp####
      setup without having to hardcode their codepage (which most people
      probably don't know.) This setup seems to have the desired effect:
      utf-8 internally, current codepage as a higher-priority option for files
      than the ambiguous latin1, and the current codepage as the default flie
      encoding.

      I'll probably revert removing the broken Korean stuff and just comment out
      the call for now; I doubt it's needed, but it's not important.

      I won't throw these patches at you yet. Instead, I'll probably be compiling
      these, describing the patches and the problems they fix better, and making
      binaries available to make them more accessible and try to get some testing
      done. Due to the scope of these changes, I suspect you'll want to wait a
      while on this.

      --
      Glenn Maynard
    • Bram Moolenaar
      ... I thought it did work for some DBCS encodings. I did include patches for this in the past. ... You still need to do the conversions, right? ... This has a
      Message 2 of 6 , Aug 5, 2002
      • 0 Attachment
        Glenn Maynard wrote:

        > Just a quick update on other fixes:
        >
        > Editing files with Unicode in the filename that don't fit in the ANSI
        > codepage doesn't work. Fixed, except for the browser, and except for
        > renaming (since I don't really want to go near win32's mch_rename, but
        > it does need fixing.)

        I thought it did work for some DBCS encodings. I did include patches
        for this in the past.

        > The IME didn't work when encoding=ucs2 (or other utf-8-but-not
        > encodings.) Fixed: use ucs2_to_penc. This also does away with
        > CONV_UCS2_TO_DBCS and CONV_DBCS_TO_UCS2 completely.

        You still need to do the conversions, right?

        > Treat all Windows codepages as DBCS, since there's no difference between
        > DBCS and SBCS in Windows (SBCS is just DBCS with no lead-bytes). We
        > talked about this, and I've hit more problems due to enc_dbcs not being
        > set when encoding is set to an SBCS CP.

        This has a big drawback: for DBCS codes finding the start of a character
        is complicated and slow. Don't want to use the same code for single
        byte encodings. There are quite a few other places where DBCS is
        handled much slower.

        Isn't it easier to ignore enc_dbcs where the code needs to be used for
        both encodings?

        > Non-ASCII in the titlebar is problematic. I know the problem and the fix,
        > but I havn't done this yet.
        >
        > I've added getacp() for Windows, to get the active codepage; this allows
        > people to use the enc=utf-8;fencs=ucs-bom,utf-8,cp####,latin1;fenc=cp####
        > setup without having to hardcode their codepage (which most people
        > probably don't know.) This setup seems to have the desired effect:
        > utf-8 internally, current codepage as a higher-priority option for files
        > than the ambiguous latin1, and the current codepage as the default flie
        > encoding.

        That sounds good, but we should look very carefully for any problems
        with backwards incompatibilities.

        > I'll probably revert removing the broken Korean stuff and just comment out
        > the call for now; I doubt it's needed, but it's not important.

        Still didn't find someone who can tell when the code is really needed?

        > I won't throw these patches at you yet. Instead, I'll probably be compiling
        > these, describing the patches and the problems they fix better, and making
        > binaries available to make them more accessible and try to get some testing
        > done. Due to the scope of these changes, I suspect you'll want to wait a
        > while on this.

        Good, I prefer including tested patches!

        --
        hundred-and-one symptoms of being an internet addict:
        108. While reading a magazine, you look for the Zoom icon for a better
        look at a photograph.

        /// Bram Moolenaar -- Bram@... -- http://www.moolenaar.net \\\
        /// Creator of Vim -- http://vim.sf.net -- ftp://ftp.vim.org/pub/vim \\\
        \\\ Project leader for A-A-P -- http://www.a-a-p.org ///
        \\\ Lord Of The Rings helps Uganda - http://iccf-holland.org/lotr.html ///
      • Glenn Maynard
        ... If encoding is set to the current codepage, it ll work: the paths are being sent directly to the system routines, unedited, and that s what the *A (ANSI)
        Message 3 of 6 , Aug 5, 2002
        • 0 Attachment
          On Mon, Aug 05, 2002 at 09:09:38PM +0200, Bram Moolenaar wrote:
          > > Editing files with Unicode in the filename that don't fit in the ANSI
          > > codepage doesn't work. Fixed, except for the browser, and except for
          > > renaming (since I don't really want to go near win32's mch_rename, but
          > > it does need fixing.)
          >
          > I thought it did work for some DBCS encodings. I did include patches
          > for this in the past.

          If encoding is set to the current codepage, it'll work: the paths are
          being sent directly to the system routines, unedited, and that's what
          the *A (ANSI) versions expect--the ANSI codepage.

          If encoding is set to anything else (including Unicode), it'll only work
          for ASCII, and will probably do something nonsensical for anything else.

          If encoding is set to the current codepage, it's impossible to represent
          filenames that don't fit in that codepage, too. (I can't edit files
          with Japanese in the filename, since my codepage is US.)

          This code does a penc->wchar conversion for all low-level file functions
          in windows, and uses the wide version. (If it fails, it falls back on
          the ANSI version, for 9x; it also keeps track of the failure, so it
          doesn't do a needless conversion for every op when on a non-Unicode
          Windows.)

          (I don't think this is a speed hit: NT is converting everything to
          Unicode eventually anyway.)

          > > The IME didn't work when encoding=ucs2 (or other utf-8-but-not
          > > encodings.) Fixed: use ucs2_to_penc. This also does away with
          > > CONV_UCS2_TO_DBCS and CONV_DBCS_TO_UCS2 completely.
          >
          > You still need to do the conversions, right?

          It needs to convert from WCHAR (which is what we get from the IME) to
          the current encoding. ucs2_to_penc does this. (Maybe I should rename
          that to wchar_to_penc().)

          > This has a big drawback: for DBCS codes finding the start of a character
          > is complicated and slow. Don't want to use the same code for single
          > byte encodings. There are quite a few other places where DBCS is
          > handled much slower.
          >
          > Isn't it easier to ignore enc_dbcs where the code needs to be used for
          > both encodings?

          Well, I need to be able to know the codepage if encoding is set to one.
          This is easy if encoding=cp932, for example, but it's less easy if it's
          "2byte-cp932" or something like that.

          Perhaps there should be a single function, win_get_penc_codepage(),
          which does all of that parsing and returns the codepage (or -1 if it's
          not a codepage)?

          Also, the is_funky_dbcs code in the win32 renderer should use this, too,
          since it needs to do the same thing. (Render with Unicode conversion if
          win_get_penc_codepage() != GetACP(); then is_funky_dbcs can probably go
          away, too, since nothing else uses it.)

          > > I've added getacp() for Windows, to get the active codepage; this allows
          > > people to use the enc=utf-8;fencs=ucs-bom,utf-8,cp####,latin1;fenc=cp####
          > > setup without having to hardcode their codepage (which most people
          > > probably don't know.) This setup seems to have the desired effect:
          > > utf-8 internally, current codepage as a higher-priority option for files
          > > than the ambiguous latin1, and the current codepage as the default flie
          > > encoding.
          >
          > That sounds good, but we should look very carefully for any problems
          > with backwards incompatibilities.

          That's why I added getacp: so I can supply a stock set of commands to
          set up this layout. This way it can be tested first without actually
          making it the default.

          > > I'll probably revert removing the broken Korean stuff and just comment out
          > > the call for now; I doubt it's needed, but it's not important.
          >
          > Still didn't find someone who can tell when the code is really needed?

          Can you contact the person named in the code? I can't find him in the
          archives at all. I still suspect it's no longer needed, due to the
          newer IME fixes, and the Korean IME does work for me, but I don't know
          about eg. older Korean IM's from 9x. All that code does is poll the IME
          when the cursor blinks, and prints whatever's in there on the cursor;
          since the IME displays the character automatically, there's no need for
          this. (But before the new IME code, this may not have worked.)

          I don't know about the weird fake-backslash code. I can see why it was
          wanted: MS Korean fonts actually do apparently have a Yen sign on \, which
          I'd imagine Korean users might not want. If you want, I can try to make
          this code work for now, and add an option for this. (I think it should
          be replaced completely at some point, as I've mentioned, but I don't
          expect to have that ready soon, since I need to figure out how to retrofit
          that without being overly intrusive. Also, since it's a nontrivial
          block of code, I'd much rather wait until the current stuff is settled,
          or the diff is going to get unmanagable and there'll be too much to test
          properly.)

          --
          Glenn Maynard
        • Bram Moolenaar
          ... I think it so far only worked for text in the system codepage. When setting encoding to something else I would guess we don t convert, thus you end up
          Message 4 of 6 , Aug 5, 2002
          • 0 Attachment
            Glenn Maynard wrote:

            > On Mon, Aug 05, 2002 at 09:09:38PM +0200, Bram Moolenaar wrote:
            > > > Editing files with Unicode in the filename that don't fit in the ANSI
            > > > codepage doesn't work. Fixed, except for the browser, and except for
            > > > renaming (since I don't really want to go near win32's mch_rename, but
            > > > it does need fixing.)
            > >
            > > I thought it did work for some DBCS encodings. I did include patches
            > > for this in the past.
            >
            > If encoding is set to the current codepage, it'll work: the paths are
            > being sent directly to the system routines, unedited, and that's what
            > the *A (ANSI) versions expect--the ANSI codepage.
            >
            > If encoding is set to anything else (including Unicode), it'll only work
            > for ASCII, and will probably do something nonsensical for anything else.
            >
            > If encoding is set to the current codepage, it's impossible to represent
            > filenames that don't fit in that codepage, too. (I can't edit files
            > with Japanese in the filename, since my codepage is US.)

            I think it so far only worked for text in the system codepage. When
            setting 'encoding' to something else I would guess we don't convert,
            thus you end up with nonsense. Converting the title to Unicode should
            work (if the wide version of the function is available, might not be
            true on Win 9x).

            > > This has a big drawback: for DBCS codes finding the start of a character
            > > is complicated and slow. Don't want to use the same code for single
            > > byte encodings. There are quite a few other places where DBCS is
            > > handled much slower.
            > >
            > > Isn't it easier to ignore enc_dbcs where the code needs to be used for
            > > both encodings?
            >
            > Well, I need to be able to know the codepage if encoding is set to one.
            > This is easy if encoding=cp932, for example, but it's less easy if it's
            > "2byte-cp932" or something like that.

            Ah, you are running into the problem that enc_dbcs is both used as a
            flag that DBCS encoding is being used and the number of the codepage
            used for 'encoding'. We could separate the two to avoid confusion.
            Introduce enc_codepage perhaps?

            > Perhaps there should be a single function, win_get_penc_codepage(),
            > which does all of that parsing and returns the codepage (or -1 if it's
            > not a codepage)?

            Since 'encoding' doesn't change very often this could be done once and
            stored in a global variable, just like enc_utf8 and enc_dbcs.

            > Also, the is_funky_dbcs code in the win32 renderer should use this, too,
            > since it needs to do the same thing. (Render with Unicode conversion if
            > win_get_penc_codepage() != GetACP(); then is_funky_dbcs can probably go
            > away, too, since nothing else uses it.)

            If GetACP() is really fast, then is_funky_dbcs becomes obsolete.
            Otherwise, I thought you were planning to rename it anyway.

            > > > I'll probably revert removing the broken Korean stuff and just comment out
            > > > the call for now; I doubt it's needed, but it's not important.
            > >
            > > Still didn't find someone who can tell when the code is really needed?
            >
            > Can you contact the person named in the code? I can't find him in the
            > archives at all. I still suspect it's no longer needed, due to the
            > newer IME fixes, and the Korean IME does work for me, but I don't know
            > about eg. older Korean IM's from 9x. All that code does is poll the IME
            > when the cursor blinks, and prints whatever's in there on the cursor;
            > since the IME displays the character automatically, there's no need for
            > this. (But before the new IME code, this may not have worked.)

            I last received a message from Sung-Hoon Baek in 1998...
            Hopefully another Korean can help us here! Namsh?

            > I don't know about the weird fake-backslash code. I can see why it was
            > wanted: MS Korean fonts actually do apparently have a Yen sign on \, which
            > I'd imagine Korean users might not want. If you want, I can try to make
            > this code work for now, and add an option for this. (I think it should
            > be replaced completely at some point, as I've mentioned, but I don't
            > expect to have that ready soon, since I need to figure out how to retrofit
            > that without being overly intrusive. Also, since it's a nontrivial
            > block of code, I'd much rather wait until the current stuff is settled,
            > or the diff is going to get unmanagable and there'll be too much to test
            > properly.)

            Even though your reasons sound sensible, I'm a bit careful about
            throwing out code that nobody complained about.

            --
            "A clear conscience is usually the sign of a bad memory."
            -- Steven Wright

            /// Bram Moolenaar -- Bram@... -- http://www.moolenaar.net \\\
            /// Creator of Vim -- http://vim.sf.net -- ftp://ftp.vim.org/pub/vim \\\
            \\\ Project leader for A-A-P -- http://www.a-a-p.org ///
            \\\ Lord Of The Rings helps Uganda - http://iccf-holland.org/lotr.html ///
          • Glenn Maynard
            ... Eh. I ll just add enc_codepage. -- Glenn Maynard
            Message 5 of 6 , Aug 5, 2002
            • 0 Attachment
              On Mon, Aug 05, 2002 at 03:50:46PM -0400, Glenn Maynard wrote:
              > > Isn't it easier to ignore enc_dbcs where the code needs to be used for
              > > both encodings?
              >
              > Perhaps there should be a single function, win_get_penc_codepage(),
              > which does all of that parsing and returns the codepage (or -1 if it's
              > not a codepage)?
              >
              > Also, the is_funky_dbcs code in the win32 renderer should use this, too,
              > since it needs to do the same thing. (Render with Unicode conversion if
              > win_get_penc_codepage() != GetACP(); then is_funky_dbcs can probably go
              > away, too, since nothing else uses it.)

              Eh. I'll just add enc_codepage.

              --
              Glenn Maynard
            • Glenn Maynard
              ... Right. That s what I m doing, with a fallback for 9x. ... We have the same idea. I ll do this. ... I think it s just returning a system-wide constant,
              Message 6 of 6 , Aug 5, 2002
              • 0 Attachment
                On Mon, Aug 05, 2002 at 10:17:17PM +0200, Bram Moolenaar wrote:
                > I think it so far only worked for text in the system codepage. When
                > setting 'encoding' to something else I would guess we don't convert,
                > thus you end up with nonsense. Converting the title to Unicode should
                > work (if the wide version of the function is available, might not be
                > true on Win 9x).

                Right. That's what I'm doing, with a fallback for 9x.

                > Ah, you are running into the problem that enc_dbcs is both used as a
                > flag that DBCS encoding is being used and the number of the codepage
                > used for 'encoding'. We could separate the two to avoid confusion.
                > Introduce enc_codepage perhaps?

                We have the same idea. I'll do this.

                > > Also, the is_funky_dbcs code in the win32 renderer should use this, too,
                > > since it needs to do the same thing. (Render with Unicode conversion if
                > > win_get_penc_codepage() != GetACP(); then is_funky_dbcs can probably go
                > > away, too, since nothing else uses it.)
                >
                > If GetACP() is really fast, then is_funky_dbcs becomes obsolete.
                > Otherwise, I thought you were planning to rename it anyway.

                I think it's just returning a system-wide constant, but I'll run a quick
                speed check anyway.

                --
                Glenn Maynard
              Your message has been successfully submitted and would be delivered to recipients shortly.