Loading ...
Sorry, an error occurred while loading the content.

Re: multibyte in patterns

Expand Messages
  • Bram Moolenaar
    ... ga shows the ascii value, but it doesn t handle an illegal byte differently, thus a single ab byte will show 00ab and the two-byte UTF-8 sequence for
    Message 1 of 9 , Jan 1, 2003
    • 0 Attachment
      Benji Fisher wrote:

      > Thanks for the reply. This has been bothering me for a while, and
      > I do not think anyone else can help with it.
      >
      > The problem is not caused by changing 'encoding' after assigning
      > the variable. I can try
      >
      > :set encoding?
      > utf-8
      > :let foo = "\xab"
      > :echo foo =~ foo
      > 0
      >
      > I would like to have a script insert a character like the one
      > given by the digraph "<C-K><<". According to "ga" (Normal mode) this
      > character has code 0xab , so I try
      >
      > :let foo = "\xab"
      > :put=foo
      >
      > and this inserts a single character that looks like "<ab>". Then I try
      >
      > :let foo = iconv("\xab", "latin1", &enc)
      > :put=foo
      >
      > and I get the "<<" digraph. Strangely, "ga" tells me these are both
      > "Hex 00ab".
      >
      > Are the tricks with iconv() supposed to work? Is there a simpler way?

      "ga" shows the ascii value, but it doesn't handle an illegal byte
      differently, thus a single "ab" byte will show "00ab" and the two-byte
      UTF-8 sequence for the "ab" character will also show "ab". "g8" shows
      what's really there.

      We need something to include a character by its hex value in a string.
      I first thought of changing "\x" for that, but this would make it
      impossible to create specific byte sequences. Let's add the "\u" item
      for this. Try the patch below. Your example should now become:

      :let foo = "\uab"

      *** ../vim61.267/src/eval.c Mon Dec 23 22:54:36 2002
      --- src/eval.c Wed Jan 1 12:41:16 2003
      ***************
      *** 2219,2236 ****
      case 'r': name[i++] = CR; break;
      case 't': name[i++] = TAB; break;

      ! /* hex: "\x1", "\x12" */
      ! case 'X':
      ! case 'x': if (isxdigit(p[1]))
      {
      ! ++p;
      ! name[i] = hex2nr(*p);
      ! if (isxdigit(p[1]))
      {
      ++p;
      ! name[i] = (name[i] << 4) + hex2nr(*p);
      }
      ! ++i;
      }
      else
      name[i++] = *p;
      --- 2219,2251 ----
      case 'r': name[i++] = CR; break;
      case 't': name[i++] = TAB; break;

      ! case 'X': /* hex: "\x1", "\x12" */
      ! case 'x':
      ! case 'u': /* Unicode: "\u0023" */
      ! case 'U':
      ! if (isxdigit(p[1]))
      {
      ! int n, nr;
      ! int c = toupper(*p);
      !
      ! if (c == 'X')
      ! n = 2;
      ! else
      ! n = 4;
      ! nr = 0;
      ! while (--n >= 0 && isxdigit(p[1]))
      {
      ++p;
      ! nr = (nr << 4) + hex2nr(*p);
      }
      ! #ifdef FEAT_MBYTE
      ! /* For "\u" store the number according to
      ! * 'encoding'. */
      ! if (c != 'X')
      ! i += (*mb_char2bytes)(nr, name + i);
      ! else
      ! #endif
      ! name[i++] = nr;
      }
      else
      name[i++] = *p;

      --
      hundred-and-one symptoms of being an internet addict:
      264. You turn to the teletext page "surfing report" and are surprised that it
      is about sizes of waves and a weather forecast for seaside resorts.

      /// Bram Moolenaar -- Bram@... -- http://www.moolenaar.net \\\
      /// Creator of Vim - Vi IMproved -- http://www.vim.org \\\
      \\\ Project leader for A-A-P -- http://www.a-a-p.org ///
      \\\ Lord Of The Rings helps Uganda - http://iccf-holland.org/lotr.html ///
    • Benji Fisher
      ... Thanks. That works. ... [patch snipped] ... « ... 0 ... 1 ... 1 echo foo == iconv(bar, latin1 , &enc) 1 so it looks pretty good to me. The second
      Message 2 of 9 , Jan 1, 2003
      • 0 Attachment
        Bram Moolenaar wrote:
        > Benji Fisher wrote:
        >
        >
        >> Thanks for the reply. This has been bothering me for a while, and
        >>I do not think anyone else can help with it.
        >>
        >> The problem is not caused by changing 'encoding' after assigning
        >>the variable. I can try
        >>
        >> :set encoding?
        >> utf-8
        >> :let foo = "\xab"
        >> :echo foo =~ foo
        >> 0
        >>
        >> I would like to have a script insert a character like the one
        >>given by the digraph "<C-K><<". According to "ga" (Normal mode) this
        >>character has code 0xab , so I try
        >>
        >> :let foo = "\xab"
        >> :put=foo
        >>
        >>and this inserts a single character that looks like "<ab>". Then I try
        >>
        >> :let foo = iconv("\xab", "latin1", &enc)
        >> :put=foo
        >>
        >>and I get the "<<" digraph. Strangely, "ga" tells me these are both
        >>"Hex 00ab".
        >>
        >> Are the tricks with iconv() supposed to work? Is there a simpler way?
        >
        >
        > "ga" shows the ascii value, but it doesn't handle an illegal byte
        > differently, thus a single "ab" byte will show "00ab" and the two-byte
        > UTF-8 sequence for the "ab" character will also show "ab". "g8" shows
        > what's really there.

        Thanks. That works.

        > We need something to include a character by its hex value in a string.
        > I first thought of changing "\x" for that, but this would make it
        > impossible to create specific byte sequences. Let's add the "\u" item
        > for this. Try the patch below. Your example should now become:
        >
        > :let foo = "\uab"
        [patch snipped]

        I tried this:

        :let foo = "\uab"
        :let bar = "\xab"
        :echo foo bar
        « <ab>
        :echo foo == bar
        0
        :echo foo =~ "^" . bar . "$"
        1
        :echo foo =~ "^" . foo . "$"
        1
        echo foo == iconv(bar, "latin1", &enc)
        1

        so it looks pretty good to me. The second =~ test is a little strange,
        but should probably work this way for backward compatibility.

        On the question of changing "\x" or adding "\u":
        * Since vim is a *text* editor, I am not convinced that it should be
        able to enter invalid bytes into my document. (I admit that
        :put=\"xe4\" does not count as entering a character *easily*.) Perhaps
        it would be better to make "\x" act like the new "\u" after all.
        * By habit and because of legacy scripts, people will continue to use
        "\x". I assume that the new "\u" will be recommended for most purposes
        (and the docs will mention this). It will take a while for people to
        adjust. Again, this argues for using "\x" to insert valid bytes, and
        adding a new construct for arbitrary bytes.

        Final question: I want my script to be able to insert "«" without
        forcing users to adopt the latest patched vim. (I am thinking of the
        LaTeX suite.) Instead of

        :let foo = "\uab"

        with this patch, should

        :let foo = iconv("\xab", "latin1", &enc)

        have the same effect? It seems to work, as far as I can tell.

        --Benji Fisher
      • Antoine J. Mechelynck
        Benji Fisher wrote: [...] ... Personally I am partisan of leaving the existing x unchanged on compatibility grounds. Giving a new
        Message 3 of 9 , Jan 1, 2003
        • 0 Attachment
          Benji Fisher <benji@...> wrote:
          [...]
          > On the question of changing "\x" or adding "\u":
          > * Since vim is a *text* editor, I am not convinced that it should be
          > able to enter invalid bytes into my document. (I admit that
          > > put=\"xe4\" does not count as entering a character *easily*.) Perhaps
          > it would be better to make "\x" act like the new "\u" after all.
          > * By habit and because of legacy scripts, people will continue to use
          > "\x". I assume that the new "\u" will be recommended for most purposes
          > (and the docs will mention this). It will take a while for people to
          > adjust. Again, this argues for using "\x" to insert valid bytes, and
          > adding a new construct for arbitrary bytes.

          Personally I am partisan of leaving the existing \x unchanged on
          compatibility grounds. Giving a new meaning (insert Unicode codepoint) to a
          hitherto undefined sequence should IMHO create less problems.

          >
          > Final question: I want my script to be able to insert "«" without
          > forcing users to adopt the latest patched vim. (I am thinking of the
          > LaTeX suite.) Instead of
          >
          > > let foo = "\uab"
          >
          > with this patch, should
          >
          > > let foo = iconv("\xab", "latin1", &enc)
          >
          > have the same effect? It seems to work, as far as I can tell.

          have you tried it with encodings for which there is no equivalent for that
          latin-1 character? (Iconv fails: what happens then?)

          >
          > --Benji Fisher

          Best wishes -- and a happy New Year
          Tony.
        • Bram Moolenaar
          ... Thanks for testing. It s a matter of taste whether foo =~ bar should result in TRUE of FALSE. Let s just leave it as it is until someone has a good
          Message 4 of 9 , Jan 1, 2003
          • 0 Attachment
            Benji Fisher wrote:

            > so it looks pretty good to me. The second =~ test is a little strange,
            > but should probably work this way for backward compatibility.

            Thanks for testing. It's a matter of taste whether foo =~ bar should
            result in TRUE of FALSE. Let's just leave it as it is until someone has
            a good reason why it should be different.

            > On the question of changing "\x" or adding "\u":
            > * Since vim is a *text* editor, I am not convinced that it should be
            > able to enter invalid bytes into my document. (I admit that
            > :put=\"xe4\" does not count as entering a character *easily*.) Perhaps
            > it would be better to make "\x" act like the new "\u" after all.

            There are always exceptions, e.g. when 'encoding' is not properly set or
            when intentionally creating illegal bytes. I don't think we have a good
            reason to forbid inserting any byte value.

            > * By habit and because of legacy scripts, people will continue to use
            > "\x". I assume that the new "\u" will be recommended for most purposes
            > (and the docs will mention this). It will take a while for people to
            > adjust. Again, this argues for using "\x" to insert valid bytes, and
            > adding a new construct for arbitrary bytes.

            Existing scripts that use "\xab" to insert valid UTF-8 bytes should keep
            on working, that's another reason why changing the meaning of "\xab" is
            a bad idea.

            > Final question: I want my script to be able to insert "«" without
            > forcing users to adopt the latest patched vim. (I am thinking of the
            > LaTeX suite.) Instead of
            >
            > :let foo = "\uab"
            >
            > with this patch, should
            >
            > :let foo = iconv("\xab", "latin1", &enc)
            >
            > have the same effect? It seems to work, as far as I can tell.

            If iconv() is supported it should work. So long as 'encoding' does
            support a character to represent the latin1 "\xab" character (not all
            8-bit encodings have it).

            --
            hundred-and-one symptoms of being an internet addict:
            269. You receive an e-mail from the wife of a deceased president, offering
            to send you twenty million dollar, and you are not even surprised.

            /// Bram Moolenaar -- Bram@... -- http://www.moolenaar.net \\\
            /// Creator of Vim - Vi IMproved -- http://www.vim.org \\\
            \\\ Project leader for A-A-P -- http://www.a-a-p.org ///
            \\\ Lord Of The Rings helps Uganda - http://iccf-holland.org/lotr.html ///
          • Benji Fisher
            ... No, I have only tried it with utf-8 and latin1. What other encodings should I try? ... Thanks! --Benji Fisher
            Message 5 of 9 , Jan 1, 2003
            • 0 Attachment
              Antoine J. Mechelynck wrote:
              > Benji Fisher <benji@...> wrote:
              >> Final question: I want my script to be able to insert "«" without
              >>forcing users to adopt the latest patched vim. (I am thinking of the
              >>LaTeX suite.) Instead of
              >>
              >>
              >>>let foo = "\uab"
              >>
              >>with this patch, should
              >>
              >>
              >>>let foo = iconv("\xab", "latin1", &enc)
              >>
              >>have the same effect? It seems to work, as far as I can tell.
              >
              >
              > have you tried it with encodings for which there is no equivalent for that
              > latin-1 character? (Iconv fails: what happens then?)

              No, I have only tried it with utf-8 and latin1. What other
              encodings should I try?

              > Best wishes -- and a happy New Year
              > Tony.

              Thanks!

              --Benji Fisher
            • Antoine J. Mechelynck
              ... [...] ... As many as possible, of course; but this is not really an answer. Maybe you could start, if you have them, with Central-European and Turkish
              Message 6 of 9 , Jan 1, 2003
              • 0 Attachment
                Benji Fisher <benji@...> wrote:
                > Antoine J. Mechelynck wrote:
                [...]
                > > have you tried it with encodings for which there is no equivalent for
                > > that latin-1 character? (Iconv fails: what happens then?)
                >
                > No, I have only tried it with utf-8 and latin1. What other
                > encodings should I try?

                As many as possible, of course; but this is not really an answer. Maybe you
                could start, if you have them, with Central-European and Turkish encodings,
                then if it works OK, with more esoteric ones like Greek, Cyrillic, Big5,
                sjis, euc-kr,... and wouldn't digraphs << and >> need to be switched around
                for right-to-left languages like Hebrew, Farsi and Arabic? -- As you see,
                I'm thinking of what the plugin would need to be as general as possible, for
                as many users as possible. Also, as could be inferred from Bram's post of a
                few minutes ago, mybe there ought to be a fallback if iconv() fails for any
                reason, and in particular for if ! has("iconv")...

                Tony.

                >
                > > Best wishes -- and a happy New Year
                > > Tony.
                >
                > Thanks!
                >
                > --Benji Fisher
              Your message has been successfully submitted and would be delivered to recipients shortly.