Loading ...
Sorry, an error occurred while loading the content.
 

Re: multibyte in patterns

Expand Messages
  • Benji Fisher
    ... Thanks. That works. ... [patch snipped] ... « ... 0 ... 1 ... 1 echo foo == iconv(bar, latin1 , &enc) 1 so it looks pretty good to me. The second
    Message 1 of 9 , Jan 1, 2003
      Bram Moolenaar wrote:
      > Benji Fisher wrote:
      >
      >
      >> Thanks for the reply. This has been bothering me for a while, and
      >>I do not think anyone else can help with it.
      >>
      >> The problem is not caused by changing 'encoding' after assigning
      >>the variable. I can try
      >>
      >> :set encoding?
      >> utf-8
      >> :let foo = "\xab"
      >> :echo foo =~ foo
      >> 0
      >>
      >> I would like to have a script insert a character like the one
      >>given by the digraph "<C-K><<". According to "ga" (Normal mode) this
      >>character has code 0xab , so I try
      >>
      >> :let foo = "\xab"
      >> :put=foo
      >>
      >>and this inserts a single character that looks like "<ab>". Then I try
      >>
      >> :let foo = iconv("\xab", "latin1", &enc)
      >> :put=foo
      >>
      >>and I get the "<<" digraph. Strangely, "ga" tells me these are both
      >>"Hex 00ab".
      >>
      >> Are the tricks with iconv() supposed to work? Is there a simpler way?
      >
      >
      > "ga" shows the ascii value, but it doesn't handle an illegal byte
      > differently, thus a single "ab" byte will show "00ab" and the two-byte
      > UTF-8 sequence for the "ab" character will also show "ab". "g8" shows
      > what's really there.

      Thanks. That works.

      > We need something to include a character by its hex value in a string.
      > I first thought of changing "\x" for that, but this would make it
      > impossible to create specific byte sequences. Let's add the "\u" item
      > for this. Try the patch below. Your example should now become:
      >
      > :let foo = "\uab"
      [patch snipped]

      I tried this:

      :let foo = "\uab"
      :let bar = "\xab"
      :echo foo bar
      « <ab>
      :echo foo == bar
      0
      :echo foo =~ "^" . bar . "$"
      1
      :echo foo =~ "^" . foo . "$"
      1
      echo foo == iconv(bar, "latin1", &enc)
      1

      so it looks pretty good to me. The second =~ test is a little strange,
      but should probably work this way for backward compatibility.

      On the question of changing "\x" or adding "\u":
      * Since vim is a *text* editor, I am not convinced that it should be
      able to enter invalid bytes into my document. (I admit that
      :put=\"xe4\" does not count as entering a character *easily*.) Perhaps
      it would be better to make "\x" act like the new "\u" after all.
      * By habit and because of legacy scripts, people will continue to use
      "\x". I assume that the new "\u" will be recommended for most purposes
      (and the docs will mention this). It will take a while for people to
      adjust. Again, this argues for using "\x" to insert valid bytes, and
      adding a new construct for arbitrary bytes.

      Final question: I want my script to be able to insert "«" without
      forcing users to adopt the latest patched vim. (I am thinking of the
      LaTeX suite.) Instead of

      :let foo = "\uab"

      with this patch, should

      :let foo = iconv("\xab", "latin1", &enc)

      have the same effect? It seems to work, as far as I can tell.

      --Benji Fisher
    • Antoine J. Mechelynck
      Benji Fisher wrote: [...] ... Personally I am partisan of leaving the existing x unchanged on compatibility grounds. Giving a new
      Message 2 of 9 , Jan 1, 2003
        Benji Fisher <benji@...> wrote:
        [...]
        > On the question of changing "\x" or adding "\u":
        > * Since vim is a *text* editor, I am not convinced that it should be
        > able to enter invalid bytes into my document. (I admit that
        > > put=\"xe4\" does not count as entering a character *easily*.) Perhaps
        > it would be better to make "\x" act like the new "\u" after all.
        > * By habit and because of legacy scripts, people will continue to use
        > "\x". I assume that the new "\u" will be recommended for most purposes
        > (and the docs will mention this). It will take a while for people to
        > adjust. Again, this argues for using "\x" to insert valid bytes, and
        > adding a new construct for arbitrary bytes.

        Personally I am partisan of leaving the existing \x unchanged on
        compatibility grounds. Giving a new meaning (insert Unicode codepoint) to a
        hitherto undefined sequence should IMHO create less problems.

        >
        > Final question: I want my script to be able to insert "«" without
        > forcing users to adopt the latest patched vim. (I am thinking of the
        > LaTeX suite.) Instead of
        >
        > > let foo = "\uab"
        >
        > with this patch, should
        >
        > > let foo = iconv("\xab", "latin1", &enc)
        >
        > have the same effect? It seems to work, as far as I can tell.

        have you tried it with encodings for which there is no equivalent for that
        latin-1 character? (Iconv fails: what happens then?)

        >
        > --Benji Fisher

        Best wishes -- and a happy New Year
        Tony.
      • Bram Moolenaar
        ... Thanks for testing. It s a matter of taste whether foo =~ bar should result in TRUE of FALSE. Let s just leave it as it is until someone has a good
        Message 3 of 9 , Jan 1, 2003
          Benji Fisher wrote:

          > so it looks pretty good to me. The second =~ test is a little strange,
          > but should probably work this way for backward compatibility.

          Thanks for testing. It's a matter of taste whether foo =~ bar should
          result in TRUE of FALSE. Let's just leave it as it is until someone has
          a good reason why it should be different.

          > On the question of changing "\x" or adding "\u":
          > * Since vim is a *text* editor, I am not convinced that it should be
          > able to enter invalid bytes into my document. (I admit that
          > :put=\"xe4\" does not count as entering a character *easily*.) Perhaps
          > it would be better to make "\x" act like the new "\u" after all.

          There are always exceptions, e.g. when 'encoding' is not properly set or
          when intentionally creating illegal bytes. I don't think we have a good
          reason to forbid inserting any byte value.

          > * By habit and because of legacy scripts, people will continue to use
          > "\x". I assume that the new "\u" will be recommended for most purposes
          > (and the docs will mention this). It will take a while for people to
          > adjust. Again, this argues for using "\x" to insert valid bytes, and
          > adding a new construct for arbitrary bytes.

          Existing scripts that use "\xab" to insert valid UTF-8 bytes should keep
          on working, that's another reason why changing the meaning of "\xab" is
          a bad idea.

          > Final question: I want my script to be able to insert "«" without
          > forcing users to adopt the latest patched vim. (I am thinking of the
          > LaTeX suite.) Instead of
          >
          > :let foo = "\uab"
          >
          > with this patch, should
          >
          > :let foo = iconv("\xab", "latin1", &enc)
          >
          > have the same effect? It seems to work, as far as I can tell.

          If iconv() is supported it should work. So long as 'encoding' does
          support a character to represent the latin1 "\xab" character (not all
          8-bit encodings have it).

          --
          hundred-and-one symptoms of being an internet addict:
          269. You receive an e-mail from the wife of a deceased president, offering
          to send you twenty million dollar, and you are not even surprised.

          /// Bram Moolenaar -- Bram@... -- http://www.moolenaar.net \\\
          /// Creator of Vim - Vi IMproved -- http://www.vim.org \\\
          \\\ Project leader for A-A-P -- http://www.a-a-p.org ///
          \\\ Lord Of The Rings helps Uganda - http://iccf-holland.org/lotr.html ///
        • Benji Fisher
          ... No, I have only tried it with utf-8 and latin1. What other encodings should I try? ... Thanks! --Benji Fisher
          Message 4 of 9 , Jan 1, 2003
            Antoine J. Mechelynck wrote:
            > Benji Fisher <benji@...> wrote:
            >> Final question: I want my script to be able to insert "«" without
            >>forcing users to adopt the latest patched vim. (I am thinking of the
            >>LaTeX suite.) Instead of
            >>
            >>
            >>>let foo = "\uab"
            >>
            >>with this patch, should
            >>
            >>
            >>>let foo = iconv("\xab", "latin1", &enc)
            >>
            >>have the same effect? It seems to work, as far as I can tell.
            >
            >
            > have you tried it with encodings for which there is no equivalent for that
            > latin-1 character? (Iconv fails: what happens then?)

            No, I have only tried it with utf-8 and latin1. What other
            encodings should I try?

            > Best wishes -- and a happy New Year
            > Tony.

            Thanks!

            --Benji Fisher
          • Antoine J. Mechelynck
            ... [...] ... As many as possible, of course; but this is not really an answer. Maybe you could start, if you have them, with Central-European and Turkish
            Message 5 of 9 , Jan 1, 2003
              Benji Fisher <benji@...> wrote:
              > Antoine J. Mechelynck wrote:
              [...]
              > > have you tried it with encodings for which there is no equivalent for
              > > that latin-1 character? (Iconv fails: what happens then?)
              >
              > No, I have only tried it with utf-8 and latin1. What other
              > encodings should I try?

              As many as possible, of course; but this is not really an answer. Maybe you
              could start, if you have them, with Central-European and Turkish encodings,
              then if it works OK, with more esoteric ones like Greek, Cyrillic, Big5,
              sjis, euc-kr,... and wouldn't digraphs << and >> need to be switched around
              for right-to-left languages like Hebrew, Farsi and Arabic? -- As you see,
              I'm thinking of what the plugin would need to be as general as possible, for
              as many users as possible. Also, as could be inferred from Bram's post of a
              few minutes ago, mybe there ought to be a fallback if iconv() fails for any
              reason, and in particular for if ! has("iconv")...

              Tony.

              >
              > > Best wishes -- and a happy New Year
              > > Tony.
              >
              > Thanks!
              >
              > --Benji Fisher
            Your message has been successfully submitted and would be delivered to recipients shortly.