Loading ...
Sorry, an error occurred while loading the content.
 

multibyte in patterns

Expand Messages
  • Benji Fisher
    More multi-byte woes. If encoding is set to utf-8, then a MB character does not match itself! ... encoding=latin1 ... 1 ... encoding=utf-8 ... 0 Is this
    Message 1 of 9 , Dec 13, 2002
      More multi-byte woes. If 'encoding' is set to utf-8, then a MB
      character does not match itself!

      :let foo = "\xab"
      :set enc?
      encoding=latin1
      :echo foo =~ foo
      1
      :set enc=utf8
      :set enc?
      encoding=utf-8
      :echo foo =~ foo
      0

      Is this expected, or is it a bug? Is there a work-around? Am I
      supposed to do this?

      :echo iconv(foo, 'latin1', &enc) =~ foo
      1

      --Benji Fisher
    • Bram Moolenaar
      ... Changing encoding makes multi-byte characters in registers and variables invalid. Although you would still expect foo =~ foo to work. The reason it
      Message 2 of 9 , Dec 31, 2002
        Benji Fisher wrote:

        > More multi-byte woes. If 'encoding' is set to utf-8, then a MB
        > character does not match itself!
        >
        > :let foo = "\xab"
        > :set enc?
        > encoding=latin1
        > :echo foo =~ foo
        > 1
        > :set enc=utf8
        > :set enc?
        > encoding=utf-8
        > :echo foo =~ foo
        > 0
        >
        > Is this expected, or is it a bug? Is there a work-around? Am I
        > supposed to do this?
        >
        > :echo iconv(foo, 'latin1', &enc) =~ foo
        > 1

        Changing 'encoding' makes multi-byte characters in registers and
        variables invalid. Although you would still expect "foo =~ foo" to
        work. The reason it doesn't is that the 0xab byte is an invalid
        character, it doesn't match anything.

        --
        hundred-and-one symptoms of being an internet addict:
        251. You've never seen your closest friends who usually live WAY too far away.

        /// Bram Moolenaar -- Bram@... -- http://www.moolenaar.net \\\
        /// Creator of Vim - Vi IMproved -- http://www.vim.org \\\
        \\\ Project leader for A-A-P -- http://www.a-a-p.org ///
        \\\ Lord Of The Rings helps Uganda - http://iccf-holland.org/lotr.html ///
      • Benji Fisher
        ... Thanks for the reply. This has been bothering me for a while, and I do not think anyone else can help with it. The problem is not caused by changing
        Message 3 of 9 , Dec 31, 2002
          Bram Moolenaar wrote:
          > Benji Fisher wrote:
          >
          >
          >> More multi-byte woes. If 'encoding' is set to utf-8, then a MB
          >>character does not match itself!
          >>
          >>:let foo = "\xab"
          >>:set enc?
          >> encoding=latin1
          >>:echo foo =~ foo
          >>1
          >>:set enc=utf8
          >>:set enc?
          >> encoding=utf-8
          >>:echo foo =~ foo
          >>0
          >>
          >>Is this expected, or is it a bug? Is there a work-around? Am I
          >>supposed to do this?
          >>
          >>:echo iconv(foo, 'latin1', &enc) =~ foo
          >>1
          >
          > Changing 'encoding' makes multi-byte characters in registers and
          > variables invalid. Although you would still expect "foo =~ foo" to
          > work. The reason it doesn't is that the 0xab byte is an invalid
          > character, it doesn't match anything.

          Thanks for the reply. This has been bothering me for a while, and
          I do not think anyone else can help with it.

          The problem is not caused by changing 'encoding' after assigning
          the variable. I can try

          :set encoding?
          utf-8
          :let foo = "\xab"
          :echo foo =~ foo
          0

          I would like to have a script insert a character like the one
          given by the digraph "<C-K><<". According to "ga" (Normal mode) this
          character has code 0xab , so I try

          :let foo = "\xab"
          :put=foo

          and this inserts a single character that looks like "<ab>". Then I try

          :let foo = iconv("\xab", "latin1", &enc)
          :put=foo

          and I get the "<<" digraph. Strangely, "ga" tells me these are both
          "Hex 00ab".

          Are the tricks with iconv() supposed to work? Is there a simpler way?

          --Benji Fisher
        • Bram Moolenaar
          ... ga shows the ascii value, but it doesn t handle an illegal byte differently, thus a single ab byte will show 00ab and the two-byte UTF-8 sequence for
          Message 4 of 9 , Jan 1, 2003
            Benji Fisher wrote:

            > Thanks for the reply. This has been bothering me for a while, and
            > I do not think anyone else can help with it.
            >
            > The problem is not caused by changing 'encoding' after assigning
            > the variable. I can try
            >
            > :set encoding?
            > utf-8
            > :let foo = "\xab"
            > :echo foo =~ foo
            > 0
            >
            > I would like to have a script insert a character like the one
            > given by the digraph "<C-K><<". According to "ga" (Normal mode) this
            > character has code 0xab , so I try
            >
            > :let foo = "\xab"
            > :put=foo
            >
            > and this inserts a single character that looks like "<ab>". Then I try
            >
            > :let foo = iconv("\xab", "latin1", &enc)
            > :put=foo
            >
            > and I get the "<<" digraph. Strangely, "ga" tells me these are both
            > "Hex 00ab".
            >
            > Are the tricks with iconv() supposed to work? Is there a simpler way?

            "ga" shows the ascii value, but it doesn't handle an illegal byte
            differently, thus a single "ab" byte will show "00ab" and the two-byte
            UTF-8 sequence for the "ab" character will also show "ab". "g8" shows
            what's really there.

            We need something to include a character by its hex value in a string.
            I first thought of changing "\x" for that, but this would make it
            impossible to create specific byte sequences. Let's add the "\u" item
            for this. Try the patch below. Your example should now become:

            :let foo = "\uab"

            *** ../vim61.267/src/eval.c Mon Dec 23 22:54:36 2002
            --- src/eval.c Wed Jan 1 12:41:16 2003
            ***************
            *** 2219,2236 ****
            case 'r': name[i++] = CR; break;
            case 't': name[i++] = TAB; break;

            ! /* hex: "\x1", "\x12" */
            ! case 'X':
            ! case 'x': if (isxdigit(p[1]))
            {
            ! ++p;
            ! name[i] = hex2nr(*p);
            ! if (isxdigit(p[1]))
            {
            ++p;
            ! name[i] = (name[i] << 4) + hex2nr(*p);
            }
            ! ++i;
            }
            else
            name[i++] = *p;
            --- 2219,2251 ----
            case 'r': name[i++] = CR; break;
            case 't': name[i++] = TAB; break;

            ! case 'X': /* hex: "\x1", "\x12" */
            ! case 'x':
            ! case 'u': /* Unicode: "\u0023" */
            ! case 'U':
            ! if (isxdigit(p[1]))
            {
            ! int n, nr;
            ! int c = toupper(*p);
            !
            ! if (c == 'X')
            ! n = 2;
            ! else
            ! n = 4;
            ! nr = 0;
            ! while (--n >= 0 && isxdigit(p[1]))
            {
            ++p;
            ! nr = (nr << 4) + hex2nr(*p);
            }
            ! #ifdef FEAT_MBYTE
            ! /* For "\u" store the number according to
            ! * 'encoding'. */
            ! if (c != 'X')
            ! i += (*mb_char2bytes)(nr, name + i);
            ! else
            ! #endif
            ! name[i++] = nr;
            }
            else
            name[i++] = *p;

            --
            hundred-and-one symptoms of being an internet addict:
            264. You turn to the teletext page "surfing report" and are surprised that it
            is about sizes of waves and a weather forecast for seaside resorts.

            /// Bram Moolenaar -- Bram@... -- http://www.moolenaar.net \\\
            /// Creator of Vim - Vi IMproved -- http://www.vim.org \\\
            \\\ Project leader for A-A-P -- http://www.a-a-p.org ///
            \\\ Lord Of The Rings helps Uganda - http://iccf-holland.org/lotr.html ///
          • Benji Fisher
            ... Thanks. That works. ... [patch snipped] ... « ... 0 ... 1 ... 1 echo foo == iconv(bar, latin1 , &enc) 1 so it looks pretty good to me. The second
            Message 5 of 9 , Jan 1, 2003
              Bram Moolenaar wrote:
              > Benji Fisher wrote:
              >
              >
              >> Thanks for the reply. This has been bothering me for a while, and
              >>I do not think anyone else can help with it.
              >>
              >> The problem is not caused by changing 'encoding' after assigning
              >>the variable. I can try
              >>
              >> :set encoding?
              >> utf-8
              >> :let foo = "\xab"
              >> :echo foo =~ foo
              >> 0
              >>
              >> I would like to have a script insert a character like the one
              >>given by the digraph "<C-K><<". According to "ga" (Normal mode) this
              >>character has code 0xab , so I try
              >>
              >> :let foo = "\xab"
              >> :put=foo
              >>
              >>and this inserts a single character that looks like "<ab>". Then I try
              >>
              >> :let foo = iconv("\xab", "latin1", &enc)
              >> :put=foo
              >>
              >>and I get the "<<" digraph. Strangely, "ga" tells me these are both
              >>"Hex 00ab".
              >>
              >> Are the tricks with iconv() supposed to work? Is there a simpler way?
              >
              >
              > "ga" shows the ascii value, but it doesn't handle an illegal byte
              > differently, thus a single "ab" byte will show "00ab" and the two-byte
              > UTF-8 sequence for the "ab" character will also show "ab". "g8" shows
              > what's really there.

              Thanks. That works.

              > We need something to include a character by its hex value in a string.
              > I first thought of changing "\x" for that, but this would make it
              > impossible to create specific byte sequences. Let's add the "\u" item
              > for this. Try the patch below. Your example should now become:
              >
              > :let foo = "\uab"
              [patch snipped]

              I tried this:

              :let foo = "\uab"
              :let bar = "\xab"
              :echo foo bar
              « <ab>
              :echo foo == bar
              0
              :echo foo =~ "^" . bar . "$"
              1
              :echo foo =~ "^" . foo . "$"
              1
              echo foo == iconv(bar, "latin1", &enc)
              1

              so it looks pretty good to me. The second =~ test is a little strange,
              but should probably work this way for backward compatibility.

              On the question of changing "\x" or adding "\u":
              * Since vim is a *text* editor, I am not convinced that it should be
              able to enter invalid bytes into my document. (I admit that
              :put=\"xe4\" does not count as entering a character *easily*.) Perhaps
              it would be better to make "\x" act like the new "\u" after all.
              * By habit and because of legacy scripts, people will continue to use
              "\x". I assume that the new "\u" will be recommended for most purposes
              (and the docs will mention this). It will take a while for people to
              adjust. Again, this argues for using "\x" to insert valid bytes, and
              adding a new construct for arbitrary bytes.

              Final question: I want my script to be able to insert "«" without
              forcing users to adopt the latest patched vim. (I am thinking of the
              LaTeX suite.) Instead of

              :let foo = "\uab"

              with this patch, should

              :let foo = iconv("\xab", "latin1", &enc)

              have the same effect? It seems to work, as far as I can tell.

              --Benji Fisher
            • Antoine J. Mechelynck
              Benji Fisher wrote: [...] ... Personally I am partisan of leaving the existing x unchanged on compatibility grounds. Giving a new
              Message 6 of 9 , Jan 1, 2003
                Benji Fisher <benji@...> wrote:
                [...]
                > On the question of changing "\x" or adding "\u":
                > * Since vim is a *text* editor, I am not convinced that it should be
                > able to enter invalid bytes into my document. (I admit that
                > > put=\"xe4\" does not count as entering a character *easily*.) Perhaps
                > it would be better to make "\x" act like the new "\u" after all.
                > * By habit and because of legacy scripts, people will continue to use
                > "\x". I assume that the new "\u" will be recommended for most purposes
                > (and the docs will mention this). It will take a while for people to
                > adjust. Again, this argues for using "\x" to insert valid bytes, and
                > adding a new construct for arbitrary bytes.

                Personally I am partisan of leaving the existing \x unchanged on
                compatibility grounds. Giving a new meaning (insert Unicode codepoint) to a
                hitherto undefined sequence should IMHO create less problems.

                >
                > Final question: I want my script to be able to insert "«" without
                > forcing users to adopt the latest patched vim. (I am thinking of the
                > LaTeX suite.) Instead of
                >
                > > let foo = "\uab"
                >
                > with this patch, should
                >
                > > let foo = iconv("\xab", "latin1", &enc)
                >
                > have the same effect? It seems to work, as far as I can tell.

                have you tried it with encodings for which there is no equivalent for that
                latin-1 character? (Iconv fails: what happens then?)

                >
                > --Benji Fisher

                Best wishes -- and a happy New Year
                Tony.
              • Bram Moolenaar
                ... Thanks for testing. It s a matter of taste whether foo =~ bar should result in TRUE of FALSE. Let s just leave it as it is until someone has a good
                Message 7 of 9 , Jan 1, 2003
                  Benji Fisher wrote:

                  > so it looks pretty good to me. The second =~ test is a little strange,
                  > but should probably work this way for backward compatibility.

                  Thanks for testing. It's a matter of taste whether foo =~ bar should
                  result in TRUE of FALSE. Let's just leave it as it is until someone has
                  a good reason why it should be different.

                  > On the question of changing "\x" or adding "\u":
                  > * Since vim is a *text* editor, I am not convinced that it should be
                  > able to enter invalid bytes into my document. (I admit that
                  > :put=\"xe4\" does not count as entering a character *easily*.) Perhaps
                  > it would be better to make "\x" act like the new "\u" after all.

                  There are always exceptions, e.g. when 'encoding' is not properly set or
                  when intentionally creating illegal bytes. I don't think we have a good
                  reason to forbid inserting any byte value.

                  > * By habit and because of legacy scripts, people will continue to use
                  > "\x". I assume that the new "\u" will be recommended for most purposes
                  > (and the docs will mention this). It will take a while for people to
                  > adjust. Again, this argues for using "\x" to insert valid bytes, and
                  > adding a new construct for arbitrary bytes.

                  Existing scripts that use "\xab" to insert valid UTF-8 bytes should keep
                  on working, that's another reason why changing the meaning of "\xab" is
                  a bad idea.

                  > Final question: I want my script to be able to insert "«" without
                  > forcing users to adopt the latest patched vim. (I am thinking of the
                  > LaTeX suite.) Instead of
                  >
                  > :let foo = "\uab"
                  >
                  > with this patch, should
                  >
                  > :let foo = iconv("\xab", "latin1", &enc)
                  >
                  > have the same effect? It seems to work, as far as I can tell.

                  If iconv() is supported it should work. So long as 'encoding' does
                  support a character to represent the latin1 "\xab" character (not all
                  8-bit encodings have it).

                  --
                  hundred-and-one symptoms of being an internet addict:
                  269. You receive an e-mail from the wife of a deceased president, offering
                  to send you twenty million dollar, and you are not even surprised.

                  /// Bram Moolenaar -- Bram@... -- http://www.moolenaar.net \\\
                  /// Creator of Vim - Vi IMproved -- http://www.vim.org \\\
                  \\\ Project leader for A-A-P -- http://www.a-a-p.org ///
                  \\\ Lord Of The Rings helps Uganda - http://iccf-holland.org/lotr.html ///
                • Benji Fisher
                  ... No, I have only tried it with utf-8 and latin1. What other encodings should I try? ... Thanks! --Benji Fisher
                  Message 8 of 9 , Jan 1, 2003
                    Antoine J. Mechelynck wrote:
                    > Benji Fisher <benji@...> wrote:
                    >> Final question: I want my script to be able to insert "«" without
                    >>forcing users to adopt the latest patched vim. (I am thinking of the
                    >>LaTeX suite.) Instead of
                    >>
                    >>
                    >>>let foo = "\uab"
                    >>
                    >>with this patch, should
                    >>
                    >>
                    >>>let foo = iconv("\xab", "latin1", &enc)
                    >>
                    >>have the same effect? It seems to work, as far as I can tell.
                    >
                    >
                    > have you tried it with encodings for which there is no equivalent for that
                    > latin-1 character? (Iconv fails: what happens then?)

                    No, I have only tried it with utf-8 and latin1. What other
                    encodings should I try?

                    > Best wishes -- and a happy New Year
                    > Tony.

                    Thanks!

                    --Benji Fisher
                  • Antoine J. Mechelynck
                    ... [...] ... As many as possible, of course; but this is not really an answer. Maybe you could start, if you have them, with Central-European and Turkish
                    Message 9 of 9 , Jan 1, 2003
                      Benji Fisher <benji@...> wrote:
                      > Antoine J. Mechelynck wrote:
                      [...]
                      > > have you tried it with encodings for which there is no equivalent for
                      > > that latin-1 character? (Iconv fails: what happens then?)
                      >
                      > No, I have only tried it with utf-8 and latin1. What other
                      > encodings should I try?

                      As many as possible, of course; but this is not really an answer. Maybe you
                      could start, if you have them, with Central-European and Turkish encodings,
                      then if it works OK, with more esoteric ones like Greek, Cyrillic, Big5,
                      sjis, euc-kr,... and wouldn't digraphs << and >> need to be switched around
                      for right-to-left languages like Hebrew, Farsi and Arabic? -- As you see,
                      I'm thinking of what the plugin would need to be as general as possible, for
                      as many users as possible. Also, as could be inferred from Bram's post of a
                      few minutes ago, mybe there ought to be a fallback if iconv() fails for any
                      reason, and in particular for if ! has("iconv")...

                      Tony.

                      >
                      > > Best wishes -- and a happy New Year
                      > > Tony.
                      >
                      > Thanks!
                      >
                      > --Benji Fisher
                    Your message has been successfully submitted and would be delivered to recipients shortly.