Loading ...
Sorry, an error occurred while loading the content.

Re: Failed to drag&drop-open a file with wide-chars in its filename

Expand Messages
  • John (Eljay) Love-Jensen
    Hi Björn, ... HFS+ uses a variant of NFD for filenames. (The HFS+ variant predates standardizatoin of NFD.) This requirement is enforced by the OS.
    Message 1 of 12 , Jun 23, 2009
    • 0 Attachment
      Hi Björn,

      > As far as I can tell (from searching around) HFS+ always uses
      > normalization form D (NFD) for filenames.

      HFS+ uses a variant of NFD for filenames. (The HFS+ variant predates
      standardizatoin of NFD.) This requirement is enforced by the OS.

      http://developer.apple.com/technotes/tn/tn1150.html
      http://developer.apple.com/technotes/tn/tn1150table.html
      http://developer.apple.com/qa/qa2001/qa1235.html
      http://www.unicode.org/reports/tr15/

      Windows uses NFC for filenames. I'm not sure if the Linux world settled on
      NFC or NFK.

      Amiga OS (at least the one I used) is ECMA 94 Latin 1 based (precursor to
      ISO 8859-1).

      > So as a workaround for the issue the OP had I now normalize filenames
      > to compatibility form C (NFKC) before passing the filename on to Vim
      > and this takes care of the OP's problem.

      NFC or NFKC? Those are different normalizations.

      Windows NTFS file system uses NFC. But it isn't enforced by the OS, yet.

      > However, as I see it this really is a legitimate issue in Vim itself
      > in that it does not handle NFD properly (the example above should
      > always render as one glyph, not three as it does now if NFD is used).
      > Either Vim should ensure that all buffers are normalized to composed
      > form NFC/NFKC or it needs to be made "NFD aware".

      I agree with your assessment.

      > Does anybody on the vim_multibyte list (this mail goes to vim_mac as
      > well) have any comments on this?

      The relevant Mac OS X routine APIs are:

      CFURLRef url =
      CFURLCreateWithFileSystemPath(
      kCFAllocatorDefault,
      cfstringFullPath,
      kCFURLPOSIXPathStyle,
      false));

      char bufferUTF8[32768*4]; // Worst case scenario.
      // As per Apple documentation, paths can be "up to 30,000 UTF-16
      // encoding units long", with each component being up to 255 UTF-16
      // encoding units long. Too bad there isn't an API to specify the
      // exact buffer size /a priori/.

      Boolean success =
      CFURLGetFileSystemRepresentation(
      url,
      true,
      &bufferUTF8[0],
      sizeof bufferUTF8);

      Sincerely,
      --Eljay


      --~--~---------~--~----~------------~-------~--~----~
      You received this message from the "vim_multibyte" maillist.
      For more information, visit http://www.vim.org/maillist.php
      -~----------~----~----~----~------~----~------~--~---
    • John (Eljay) Love-Jensen
      ... I meant: ... NFC or NFD. Fat fingers. --Eljay --~--~---------~--~----~------------~-------~--~----~ You received this message from the vim_multibyte
      Message 2 of 12 , Jun 23, 2009
      • 0 Attachment
        > Windows uses NFC for filenames. I'm not sure if the Linux world settled on
        > NFC or NFK.

        I meant: ... NFC or NFD.

        Fat fingers.

        --Eljay


        --~--~---------~--~----~------------~-------~--~----~
        You received this message from the "vim_multibyte" maillist.
        For more information, visit http://www.vim.org/maillist.php
        -~----------~----~----~----~------~----~------~--~---
      • Andrew Dunbar
        ... When I worked on AbiWord a few years ago Linux left filename encoding up to the filesystem and the user. This may have changed since... Linux supports many
        Message 3 of 12 , Jun 23, 2009
        • 0 Attachment
          2009/6/23 John (Eljay) Love-Jensen <eljay@...>:
          >
          > Hi Björn,
          >
          >> As far as I can tell (from searching around) HFS+ always uses
          >> normalization form D (NFD) for filenames.
          >
          > HFS+ uses a variant of NFD for filenames.  (The HFS+ variant predates
          > standardizatoin of NFD.)  This requirement is enforced by the OS.
          >
          > http://developer.apple.com/technotes/tn/tn1150.html
          > http://developer.apple.com/technotes/tn/tn1150table.html
          > http://developer.apple.com/qa/qa2001/qa1235.html
          > http://www.unicode.org/reports/tr15/
          >
          > Windows uses NFC for filenames.  I'm not sure if the Linux world settled on
          > NFC or NFK.

          When I worked on AbiWord a few years ago Linux left filename encoding
          up to the filesystem and the user. This may have changed since...

          Linux supports many filesystems including Windows and Mac filesystems.
          For filesystems which mandate a specific encoding Linux should follow
          those rules. For older filesystems the encoding would generally be the
          encoding of the OS but... Linux as Unix is a multisuer OS and may have
          various users using various languages in various encodings. Each user
          gets to decide their language and encoding through enviroment
          variables such as LANG, LC_ALL, LC_COLLATE etc. These vary by vintage
          of the OS and may well vary for other Unixes too such as FreeBSD.

          I think Linux generally uses extN filesytems as default. When I was
          last working with it that was ext2 but ext3 has now been in use for
          some time and ext4 is the current iteration which may or may not be in
          general release. The ext3 or ext4 filesystems may mandate an encoding
          that ext2 did not.

          The general soltion for the Unix/Linux world may be to honour the
          user's locale settings and assume that the filesystem software will
          convert to any specifically mandated encoding it requires when you
          call the standard open() etc APIs.

          But further research is definitely recommended!

          Andrew Dunbar.


          > Amiga OS (at least the one I used) is ECMA 94 Latin 1 based (precursor to
          > ISO 8859-1).
          >
          >> So as a workaround for the issue the OP had I now normalize filenames
          >> to compatibility form C (NFKC) before passing the filename on to Vim
          >> and this takes care of the OP's problem.
          >
          > NFC or NFKC?  Those are different normalizations.
          >
          > Windows NTFS file system uses NFC.  But it isn't enforced by the OS, yet.
          >
          >> However, as I see it this really is a legitimate issue in Vim itself
          >> in that it does not handle NFD properly (the example above should
          >> always render as one glyph, not three as it does now if NFD is used).
          >> Either Vim should ensure that all buffers are normalized to composed
          >> form NFC/NFKC or it needs to be made "NFD aware".
          >
          > I agree with your assessment.
          >
          >> Does anybody on the vim_multibyte list (this mail goes to vim_mac as
          >> well) have any comments on this?
          >
          > The relevant Mac OS X routine APIs are:
          >
          > CFURLRef url =
          > CFURLCreateWithFileSystemPath(
          >  kCFAllocatorDefault,
          >  cfstringFullPath,
          >  kCFURLPOSIXPathStyle,
          >  false));
          >
          > char bufferUTF8[32768*4]; // Worst case scenario.
          > // As per Apple documentation, paths can be "up to 30,000 UTF-16
          > // encoding units long", with each component being up to 255 UTF-16
          > // encoding units long.  Too bad there isn't an API to specify the
          > // exact buffer size /a priori/.
          >
          > Boolean success =
          > CFURLGetFileSystemRepresentation(
          >  url,
          >  true,
          >  &bufferUTF8[0],
          >  sizeof bufferUTF8);
          >
          > Sincerely,
          > --Eljay
          >
          >
          > >
          >



          --
          http://wiktionarydev.leuksman.com http://linguaphile.sf.net

          --~--~---------~--~----~------------~-------~--~----~
          You received this message from the "vim_multibyte" maillist.
          For more information, visit http://www.vim.org/maillist.php
          -~----------~----~----~----~------~----~------~--~---
        • Nico Weber
          ... I m pretty sure it hasn t. As far as I know, for linux a filename is just a bunch of bytes, and you only need to know the encoding for lesser tasks such as
          Message 4 of 12 , Jun 23, 2009
          • 0 Attachment
            >> Windows uses NFC for filenames. I'm not sure if the Linux world
            >> settled on
            >> NFC or NFK.
            >
            > When I worked on AbiWord a few years ago Linux left filename encoding
            > up to the filesystem and the user. This may have changed since...


            I'm pretty sure it hasn't. As far as I know, for linux a filename is
            just a bunch of bytes, and you only need to know the encoding for
            lesser tasks such as file name display anyway ;-) In that case, the
            recommended way is to get the encoding from an env var.

            Nico

            --~--~---------~--~----~------------~-------~--~----~
            You received this message from the "vim_multibyte" maillist.
            For more information, visit http://www.vim.org/maillist.php
            -~----------~----~----~----~------~----~------~--~---
          • björn
            Hi Eljay, ... Thanks for clarifying that (and for the links!). ... I read that Windows uses NFKC. Have you got a reference for the claim that NFC is used? ...
            Message 5 of 12 , Jun 24, 2009
            • 0 Attachment
              Hi Eljay,

              2009/6/23 John (Eljay) Love-Jensen:
              >
              >> As far as I can tell (from searching around) HFS+ always uses
              >> normalization form D (NFD) for filenames.
              >
              > HFS+ uses a variant of NFD for filenames.  (The HFS+ variant predates
              > standardizatoin of NFD.)  This requirement is enforced by the OS.
              >
              > http://developer.apple.com/technotes/tn/tn1150.html
              > http://developer.apple.com/technotes/tn/tn1150table.html
              > http://developer.apple.com/qa/qa2001/qa1235.html
              > http://www.unicode.org/reports/tr15/

              Thanks for clarifying that (and for the links!).

              > Windows uses NFC for filenames.  I'm not sure if the Linux world settled on
              > NFC or NFK.

              I read that Windows uses NFKC. Have you got a reference for the claim
              that NFC is used?

              >> So as a workaround for the issue the OP had I now normalize filenames
              >> to compatibility form C (NFKC) before passing the filename on to Vim
              >> and this takes care of the OP's problem.
              >
              > NFC or NFKC?  Those are different normalizations.
              >
              > Windows NTFS file system uses NFC.  But it isn't enforced by the OS, yet.

              I did mean the compatibility form NFKC since I read somewhere that
              NTFS uses NFKC, but I did not research that very carefully.


              >> However, as I see it this really is a legitimate issue in Vim itself
              >> in that it does not handle NFD properly (the example above should
              >> always render as one glyph, not three as it does now if NFD is used).
              >> Either Vim should ensure that all buffers are normalized to composed
              >> form NFC/NFKC or it needs to be made "NFD aware".
              >
              > I agree with your assessment.
              >
              >> Does anybody on the vim_multibyte list (this mail goes to vim_mac as
              >> well) have any comments on this?
              >
              > The relevant Mac OS X routine APIs are:
              >
              > CFURLRef url =
              > CFURLCreateWithFileSystemPath(
              >  kCFAllocatorDefault,
              >  cfstringFullPath,
              >  kCFURLPOSIXPathStyle,
              >  false));
              >
              > char bufferUTF8[32768*4]; // Worst case scenario.
              > // As per Apple documentation, paths can be "up to 30,000 UTF-16
              > // encoding units long", with each component being up to 255 UTF-16
              > // encoding units long.  Too bad there isn't an API to specify the
              > // exact buffer size /a priori/.
              >
              > Boolean success =
              > CFURLGetFileSystemRepresentation(
              >  url,
              >  true,
              >  &bufferUTF8[0],
              >  sizeof bufferUTF8);

              Thanks. NSString has a method called fileSystemRepresentation which
              I'm guessing does the same thing(?). I used the NSString method
              precomposedStringWithCompatibilityMapping to convert to NFKC.

              Björn

              --~--~---------~--~----~------------~-------~--~----~
              You received this message from the "vim_multibyte" maillist.
              For more information, visit http://www.vim.org/maillist.php
              -~----------~----~----~----~------~----~------~--~---
            • John (Eljay) Love-Jensen
              Hi Björn, ... Drat, I cannot find the MSDN reference. Maybe my memory has failed me. NFKC is lossy. NFC is non-lossy. Perhaps you are remembering the
              Message 6 of 12 , Jun 24, 2009
              • 0 Attachment
                Hi Björn,

                > I read that Windows uses NFKC. Have you got a reference for the claim
                > that NFC is used?

                Drat, I cannot find the MSDN reference. Maybe my memory has failed me.

                NFKC is lossy. NFC is non-lossy.

                Perhaps you are remembering the security information:
                http://msdn.microsoft.com/en-us/library/dd374047(VS.85).aspx#SC_Unicode

                File Names, Paths, and Namespaces information is here:
                http://msdn.microsoft.com/en-us/library/aa365247(VS.85).aspx

                Note that modern UNC (starts with "\\?\" (for paths) or with "\\.\" (for
                volumes) -- such as "\\?\C:\Dir\Sub\File.ext", and up to 32,767 UTF-16
                encoding units (Vista), or UCS-2 characters (XP), using 16-bit encoding of
                Unicode) is different from older "short" UNC (DOS-era limit of 260 8-bit
                characters dependent on the OS code page setting).

                The NFC is mentioned here in a MSDN blog:
                http://blogs.msdn.com/michkap/archive/2006/12/07/1232365.aspx

                But I don't consider that canonical, since it was in a blog feedback
                comment.

                I asked for clarification on the MSDN "File Names, Paths, and Namespaces"
                page, in the comments section.

                NOTE: "short" UNC and "old" DOS style has to abide by the OS code page
                setting. Even when using the FooW routines and wchar_t (16-bit) paths.

                > Thanks. NSString has a method called fileSystemRepresentation which
                > I'm guessing does the same thing(?). I used the NSString method
                > precomposedStringWithCompatibilityMapping to convert to NFKC.

                I presume so. My Cocoa experience is not as extensive as my Carbon
                experience.

                Sincerely,
                --Eljay


                --~--~---------~--~----~------------~-------~--~----~
                You received this message from the "vim_multibyte" maillist.
                For more information, visit http://www.vim.org/maillist.php
                -~----------~----~----~----~------~----~------~--~---
              • Tony Mechelynck
                ... Hm, NFKC and NFKD sometimes fuse slightly different glyphs into a single normalized form. For instance, NFKC(²) = 2, though both are (different) Latin1
                Message 7 of 12 , Jun 24, 2009
                • 0 Attachment
                  On 24/06/09 14:00, björn wrote:
                  >
                  > Hi Eljay,
                  >
                  > 2009/6/23 John (Eljay) Love-Jensen:
                  >>
                  >>> As far as I can tell (from searching around) HFS+ always uses
                  >>> normalization form D (NFD) for filenames.
                  >>
                  >> HFS+ uses a variant of NFD for filenames. (The HFS+ variant predates
                  >> standardizatoin of NFD.) This requirement is enforced by the OS.
                  >>
                  >> http://developer.apple.com/technotes/tn/tn1150.html
                  >> http://developer.apple.com/technotes/tn/tn1150table.html
                  >> http://developer.apple.com/qa/qa2001/qa1235.html
                  >> http://www.unicode.org/reports/tr15/
                  >
                  > Thanks for clarifying that (and for the links!).
                  >
                  >> Windows uses NFC for filenames. I'm not sure if the Linux world settled on
                  >> NFC or NFK.
                  >
                  > I read that Windows uses NFKC. Have you got a reference for the claim
                  > that NFC is used?
                  >
                  >>> So as a workaround for the issue the OP had I now normalize filenames
                  >>> to compatibility form C (NFKC) before passing the filename on to Vim
                  >>> and this takes care of the OP's problem.
                  >>
                  >> NFC or NFKC? Those are different normalizations.
                  >>
                  >> Windows NTFS file system uses NFC. But it isn't enforced by the OS, yet.
                  >
                  > I did mean the compatibility form NFKC since I read somewhere that
                  > NTFS uses NFKC, but I did not research that very carefully.
                  >
                  >
                  >>> However, as I see it this really is a legitimate issue in Vim itself
                  >>> in that it does not handle NFD properly (the example above should
                  >>> always render as one glyph, not three as it does now if NFD is used).
                  >>> Either Vim should ensure that all buffers are normalized to composed
                  >>> form NFC/NFKC or it needs to be made "NFD aware".
                  >>
                  >> I agree with your assessment.
                  >>
                  >>> Does anybody on the vim_multibyte list (this mail goes to vim_mac as
                  >>> well) have any comments on this?
                  >>
                  >> The relevant Mac OS X routine APIs are:
                  >>
                  >> CFURLRef url =
                  >> CFURLCreateWithFileSystemPath(
                  >> kCFAllocatorDefault,
                  >> cfstringFullPath,
                  >> kCFURLPOSIXPathStyle,
                  >> false));
                  >>
                  >> char bufferUTF8[32768*4]; // Worst case scenario.
                  >> // As per Apple documentation, paths can be "up to 30,000 UTF-16
                  >> // encoding units long", with each component being up to 255 UTF-16
                  >> // encoding units long. Too bad there isn't an API to specify the
                  >> // exact buffer size /a priori/.
                  >>
                  >> Boolean success =
                  >> CFURLGetFileSystemRepresentation(
                  >> url,
                  >> true,
                  >> &bufferUTF8[0],
                  >> sizeof bufferUTF8);
                  >
                  > Thanks. NSString has a method called fileSystemRepresentation which
                  > I'm guessing does the same thing(?). I used the NSString method
                  > precomposedStringWithCompatibilityMapping to convert to NFKC.
                  >
                  > Björn

                  Hm, NFKC and NFKD sometimes fuse slightly different glyphs into a single
                  "normalized" form. For instance, NFKC(²) = 2, though both are
                  (different) Latin1 characters (0xB2 and 0x32). IIRC, DOS would have kept
                  them distinct.

                  Best regards,
                  Tony.
                  --
                  hundred-and-one symptoms of being an internet addict:
                  56. You leave the modem speaker on after connecting because you think it
                  sounds like the ocean wind...the perfect soundtrack for "surfing
                  the net".

                  --~--~---------~--~----~------------~-------~--~----~
                  You received this message from the "vim_multibyte" maillist.
                  For more information, visit http://www.vim.org/maillist.php
                  -~----------~----~----~----~------~----~------~--~---
                Your message has been successfully submitted and would be delivered to recipients shortly.