Loading ...
Sorry, an error occurred while loading the content.
 

Re: Woe with MBCS File Names in UTF-8 Mode on Windows

Expand Messages
  • Bram Moolenaar
    ... I would guess that somewhere in the code the DBCS codepage is used to locate the character, instead of using it as UTF-8. Since I don t have a DBCS
    Message 1 of 13 , Jul 1 2:41 AM
      Yongwei wrote:

      > > > BTW, the strange problem seems in the three Chinese characters.
      > > > `:e 测试.txt' and `:e 试件.txt' both are OK.
      > > > However, some other characters in the file name can become corrupt
      > > > when saving the file, e.g., 炜 (e7829c in UTF-8, ecbf in
      > > > GBK) will become ç? (c3a7 c282 in UTF-8). I have no clue how it
      > > > comes.
      > >
      > > I'm afraid I also don't know. Perhaps there is some problem with
      > > conversion from Unicode to your current codepage. This uses the
      > > MS-Windows library functions, thus it's not something I can fix.
      >
      > I did a trace into Vim, and I found that it was because the `9c' of
      > e7829c (炜) had been lost before mch_open is called. Could
      > this give you a clue? Or give me a guidance where I should
      > investigate further?

      I would guess that somewhere in the code the DBCS codepage is used to
      locate the character, instead of using it as UTF-8. Since I don't have
      a DBCS system, I can't try this.

      If you are able to see what happens in a debugger then you should be
      able to follow the route from typing the command to the mch_open() call.

      --
      Some of the well know MS-Windows errors:
      ETIME Wrong time, wait a little while
      ECRASH Try again...
      EDETECT Unable to detect errors
      EOVER You lost! Play another game?
      ENOCLUE Eh, what did you want?

      /// Bram Moolenaar -- Bram@... -- http://www.Moolenaar.net \\\
      /// Sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ \\\
      \\\ Project leader for A-A-P -- http://www.A-A-P.org ///
      \\\ Buy LOTR 3 and help AIDS victims -- http://ICCF.nl/lotr.html ///
    • adah@netstd.com
      ... Since I was tracing mch_open out (not from outside in), I soon lost my way. And I was not familiar with the Vim code organization. That is the reason why
      Message 2 of 13 , Jul 1 2:59 AM
        > > I did a trace into Vim, and I found that it was because the `9c' of
        > > e7829c (炜) had been lost before mch_open is called. Could this
        > > give you a clue? Or give me a guidance where I should investigate
        > > further?
        >
        > I would guess that somewhere in the code the DBCS codepage is used to
        > locate the character, instead of using it as UTF-8. Since I don't
        > have a DBCS system, I can't try this.
        >
        > If you are able to see what happens in a debugger then you should be
        > able to follow the route from typing the command to the mch_open()
        > call.

        Since I was tracing mch_open out (not from outside in), I soon lost my
        way. And I was not familiar with the Vim code organization. That is
        the reason why I asked for guidance. I need a starting point to trace
        (where `:w file.txt' is really executed).

        And it is not difficult to change one's system into a DBCS one, as long
        as one has a Windows 2000/XP box with installation files/CD. Just
        install the Far East support and set the default code page in the
        Regional Setting.

        Best regards,

        Yongwei
      • Bram Moolenaar
        ... You can step out of mch_open() to see what happened in the calling function. If you need to step through the code that leads to opening the file you might
        Message 3 of 13 , Jul 1 3:38 AM
          Yongwei wrote:

          > > > I did a trace into Vim, and I found that it was because the `9c' of
          > > > e7829c (ì¿) had been lost before mch_open is called. Could this
          > > > give you a clue? Or give me a guidance where I should investigate
          > > > further?
          > >
          > > I would guess that somewhere in the code the DBCS codepage is used to
          > > locate the character, instead of using it as UTF-8. Since I don't
          > > have a DBCS system, I can't try this.
          > >
          > > If you are able to see what happens in a debugger then you should be
          > > able to follow the route from typing the command to the mch_open()
          > > call.
          >
          > Since I was tracing mch_open out (not from outside in), I soon lost my
          > way. And I was not familiar with the Vim code organization. That is
          > the reason why I asked for guidance. I need a starting point to trace
          > (where `:w file.txt' is really executed).

          You can step out of mch_open() to see what happened in the calling
          function.

          If you need to step through the code that leads to opening the file you
          might want to put a breakpoint in open_buffer(). Check that
          curbuf->b_ffname is right. The file reading is done in readfile().

          --
          hundred-and-one symptoms of being an internet addict:
          202. You're amazed to find out Spam is a food.

          /// Bram Moolenaar -- Bram@... -- http://www.Moolenaar.net \\\
          /// Sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ \\\
          \\\ Project leader for A-A-P -- http://www.A-A-P.org ///
          \\\ Buy LOTR 3 and help AIDS victims -- http://ICCF.nl/lotr.html ///
        • adah@netstd.com
          ... I have finally found out the reason. The cause is the _fullpath (which finally calls GetFullPathNameA) in mch_FullName. It is quite normal that the
          Message 4 of 13 , Jul 1 8:53 PM
            Bram wrote:
            >
            > Yongwei wrote:
            >
            > > > > I did a trace into Vim, and I found that it was because the `9c'
            > > > > of e7829c (炜) had been lost before mch_open is called. Could
            > > > > this give you a clue? Or give me a guidance where I should
            > > > > investigate further?
            > > >
            > > > I would guess that somewhere in the code the DBCS codepage is used
            > > > to locate the character, instead of using it as UTF-8. Since I
            > > > don't have a DBCS system, I can't try this.
            > > >
            > > > If you are able to see what happens in a debugger then you should
            > > > be able to follow the route from typing the command to the
            > > > mch_open() call.
            > >
            > > Since I was tracing mch_open out (not from outside in), I soon lost
            > > my way. And I was not familiar with the Vim code organization.
            > > That is the reason why I asked for guidance. I need a starting
            > > point to trace (where `:w file.txt' is really executed).
            >
            > You can step out of mch_open() to see what happened in the calling
            > function.
            >
            > If you need to step through the code that leads to opening the file
            > you might want to put a breakpoint in open_buffer(). Check that
            > curbuf->b_ffname is right. The file reading is done in readfile().

            I have finally found out the reason. The cause is the _fullpath (which
            finally calls GetFullPathNameA) in mch_FullName. It is quite normal
            that the non-Unicode Win32 API requires that file names should be
            provided in native encoding.

            Non-DBCS-system users generally will not feel the problem since valid
            UTF-8 code points are generally valid SBCS (say, Latin1) code points,
            and 炜.txt will be regarded as code points |e7 82 9c 2e 74 78 74|. On
            DBCS systems, |9c2e| is invalid and will become `?' (|3f|).

            To solve this problem, maybe Vim needs to provide its own verion of
            fullpath? Bram, what is your opinion?

            Best regards,

            Yongwei
          • Bram Moolenaar
            ... I m glad you were able to isolate the problem. Vim 7 already included a fix for this. This has been tried out for a while now, thus I think it s safe to
            Message 5 of 13 , Jul 2 4:15 AM
              Yongwei wrote:

              > I have finally found out the reason. The cause is the _fullpath (which
              > finally calls GetFullPathNameA) in mch_FullName. It is quite normal
              > that the non-Unicode Win32 API requires that file names should be
              > provided in native encoding.
              >
              > Non-DBCS-system users generally will not feel the problem since valid
              > UTF-8 code points are generally valid SBCS (say, Latin1) code points,
              > and ì¿.txt will be regarded as code points |e7 82 9c 2e 74 78 74|. On
              > DBCS systems, |9c2e| is invalid and will become `?' (|3f|).
              >
              > To solve this problem, maybe Vim needs to provide its own verion of
              > fullpath? Bram, what is your opinion?

              I'm glad you were able to isolate the problem.

              Vim 7 already included a fix for this. This has been tried out for a
              while now, thus I think it's safe to include in Vim 6.3. Please try out
              this patch. If it works OK for you then I'll release it.

              *** os_mswin.c~ Sun Dec 5 16:39:37 2004
              --- os_mswin.c Sat Jul 2 13:07:35 2005
              ***************
              *** 367,385 ****
              nResult = mch_dirname(buf, len);
              else
              #endif
              - if (_fullpath(buf, fname, len - 1) == NULL)
              {
              ! STRNCPY(buf, fname, len); /* failed, use the relative path name */
              ! buf[len - 1] = NUL;
              ! #ifndef USE_FNAME_CASE
              ! slash_adjust(buf);
              #endif
              }
              - else
              - nResult = OK;

              #ifdef USE_FNAME_CASE
              fname_case(buf, len);
              #endif

              return nResult;
              --- 367,421 ----
              nResult = mch_dirname(buf, len);
              else
              #endif
              {
              ! #ifdef FEAT_MBYTE
              ! if (enc_codepage >= 0 && (int)GetACP() != enc_codepage
              ! # ifdef __BORLANDC__
              ! /* Wide functions of Borland C 5.5 do not work on Windows 98. */
              ! && g_PlatformId == VER_PLATFORM_WIN32_NT
              ! # endif
              ! )
              ! {
              ! WCHAR *wname;
              ! WCHAR wbuf[MAX_PATH];
              ! char_u *cname = NULL;
              !
              ! /* Use the wide function:
              ! * - convert the fname from 'encoding' to UCS2.
              ! * - invoke _wfullpath()
              ! * - convert the result from UCS2 to 'encoding'.
              ! */
              ! wname = enc_to_ucs2(fname, NULL);
              ! if (wname != NULL && _wfullpath(wbuf, wname, MAX_PATH - 1) != NULL)
              ! {
              ! cname = ucs2_to_enc((short_u *)wbuf, NULL);
              ! if (cname != NULL)
              ! {
              ! STRNCPY(buf, cname, len);
              ! buf[len - 1] = NUL;
              ! nResult = OK;
              ! }
              ! }
              ! vim_free(wname);
              ! vim_free(cname);
              ! }
              ! if (nResult == FAIL) /* fall back to non-wide function */
              #endif
              + {
              + if (_fullpath(buf, fname, len - 1) == NULL)
              + {
              + STRNCPY(buf, fname, len); /* failed, use relative path name */
              + buf[len - 1] = NUL;
              + }
              + else
              + nResult = OK;
              + }
              }

              #ifdef USE_FNAME_CASE
              fname_case(buf, len);
              + #else
              + slash_adjust(buf);
              #endif

              return nResult;

              --
              hundred-and-one symptoms of being an internet addict:
              210. When you get a divorce, you don't care about who gets the children,
              but discuss endlessly who can use the email address.

              /// Bram Moolenaar -- Bram@... -- http://www.Moolenaar.net \\\
              /// Sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ \\\
              \\\ Project leader for A-A-P -- http://www.A-A-P.org ///
              \\\ Buy LOTR 3 and help AIDS victims -- http://ICCF.nl/lotr.html ///
            • adah@netstd.com
              ... Yes, your patch works like a charm. Thanks, Bram! Best regards, Yongwei
              Message 6 of 13 , Jul 3 10:12 PM
                Bram wrote:
                >
                > Yongwei wrote:
                >
                > > I have finally found out the reason. The cause is the _fullpath
                > > (which finally calls GetFullPathNameA) in mch_FullName. It is quite
                > > normal that the non-Unicode Win32 API requires that file names
                > > should be provided in native encoding.
                > >
                > > Non-DBCS-system users generally will not feel the problem since
                > > valid UTF-8 code points are generally valid SBCS (say, Latin1) code
                > > points, and 炜.txt will be regarded as code points |e7 82 9c 2e 74
                > > 78 74|. On DBCS systems, |9c2e| is invalid and will become `?'
                > > (|3f|).
                > >
                > > To solve this problem, maybe Vim needs to provide its own verion of
                > > fullpath? Bram, what is your opinion?
                >
                > I'm glad you were able to isolate the problem.
                >
                > Vim 7 already included a fix for this. This has been tried out for a
                > while now, thus I think it's safe to include in Vim 6.3. Please try
                > out this patch. If it works OK for you then I'll release it.

                Yes, your patch works like a charm. Thanks, Bram!

                Best regards,

                Yongwei
              Your message has been successfully submitted and would be delivered to recipients shortly.