Loading ...
Sorry, an error occurred while loading the content.
 

Re: Dealing with empty strings in regexp.

Expand Messages
  • LCD 47
    ... You re making up a metaphysics of empty substrings. I humbly submit that there is no such thing in the programming languages you mention (don t know about
    Message 1 of 10 , Jun 18, 2013
      On 18 June 2013, Paul Isambert <zappathustra@...> wrote:
      > Hello all,
      >
      > The following issue has been recently discussed on the Lua mailing list:
      > http://lua-users.org/lists/lua-l/2013-04/msg00812.html
      >
      > (It has also been independantly raised on the LuaTeX list:
      > http://tug.org/pipermail/luatex/2013-June/004418.html)
      >
      > If I understand correctly, any string can be represented with
      > interspersed empty substrings. E.g. “abc” is really “ϵaϵbϵcϵ”, where
      > “ϵ” is the empty string. Now, there seems to be two ways to deal with
      > those empty strings in regexps, especially regarding the “*” operator:

      You're making up a metaphysics of empty substrings. I humbly submit
      that there is no such thing in the programming languages you mention
      (don't know about Lua though).

      > - The Perl way: “X*” matches as many “X” as possible, and does not
      > include the following empty string.

      $ echo -n abc | perl -pe 's/[ac]*/($&)/g'
      (a)()b(c)()

      The key to understanding this is to keep in mind that:

      (1) "*" is greedy; and
      (2) "/g" is defined as "Global matching, and keep the Current position
      after failed matching."

      Try something like this if you want the gory details:

      $ echo -n abc | perl -Mre=debug -ne 's/[ac]*/($&)/g'

      > - The Python (or sed) way: “X*” matches as many “X” as possible, and
      > includes the following empty string.
      >
      > Starting empty strings are always included. So, the Perl way gives (I
      > use Ruby, since I can’t speak Perl):
      >
      > puts 'abc'.gsub(/[ac]*/, '(\0)')
      > # returns “(a)()b(c)()”, really “(ϵa)(ϵ)b(ϵc)(ϵ)”

      Same thing with Ruby: there's a current position pointer, keeping
      track of the current match.

      > And the Python way:
      >
      > import re
      > print re.sub(re.compile('(a*)'), '(\\1)', 'abc')
      > # returns “(a)b(c)”, really “(ϵaϵ)b(ϵcϵ)”

      With Python, re.sub() "return[s] the string obtained by replacing
      the leftmost non-overlapping occurrences of pattern in string by the
      replacement repl". It's the same thing, except for an optimisation:
      "empty matches are included in the result unless they touch the
      beginning of another match".

      > (Note that adding “$” to the patterns doesn’t change anything.)
      >
      > Now, VimL works in the Perl way, except that “*” includes the empty
      > string if it is the last one in the string:
      >
      > echo substitute('abc', '[ac]*', '(\0)', 'g')
      > " returns “(a)()b(c)”, really “(ϵa)(ϵ)b(ϵcϵ)”

      Again the same thing, except the optimisation above is applied only
      at the end of the string.

      > As far as I’m concerned, I find the Perl way quite counter-intuitive,
      > but what I’m interested in here is whether VimL is consistent or not.
      > I.e., shouldn’t it work clearly one way or the other?

      You came up with the concept of "ϵ", you fix its limitations. :)

      My conclusion to the above comparison is that Vim should apply the
      same optimisation in full, that is, kill the empty matches that touch
      the beginning of another match. As far as I can tell, that would be
      safe for both the old and the new regexp engines.

      /lcd

      --
      --
      You received this message from the "vim_use" maillist.
      Do not top-post! Type your reply below the text you are replying to.
      For more information, visit http://www.vim.org/maillist.php

      ---
      You received this message because you are subscribed to the Google Groups "vim_use" group.
      To unsubscribe from this group and stop receiving emails from it, send an email to vim_use+unsubscribe@....
      For more options, visit https://groups.google.com/groups/opt_out.
    • Paul Isambert
      ... As I’ve already said, the empty strings were just meant to capture the differences between languages. I did not mean to imply that those substrings have
      Message 2 of 10 , Jun 18, 2013
        LCD 47 <lcd047@...> a écrit:
        > On 18 June 2013, Paul Isambert <zappathustra@...> wrote:
        > > Hello all,
        > >
        > > The following issue has been recently discussed on the Lua mailing list:
        > > http://lua-users.org/lists/lua-l/2013-04/msg00812.html
        > >
        > > (It has also been independantly raised on the LuaTeX list:
        > > http://tug.org/pipermail/luatex/2013-June/004418.html)
        > >
        > > If I understand correctly, any string can be represented with
        > > interspersed empty substrings. E.g. “abc” is really “ϵaϵbϵcϵ”, where
        > > “ϵ” is the empty string. Now, there seems to be two ways to deal with
        > > those empty strings in regexps, especially regarding the “*” operator:
        >
        > You're making up a metaphysics of empty substrings. I humbly submit
        > that there is no such thing in the programming languages you mention
        > (don't know about Lua though).

        As I’ve already said, the empty strings were just meant to capture the
        differences between languages. I did not mean to imply that those
        substrings have any kind of reality.

        > > - The Perl way: “X*” matches as many “X” as possible, and does not
        > > include the following empty string.
        >
        > $ echo -n abc | perl -pe 's/[ac]*/($&)/g'
        > (a)()b(c)()
        >
        > The key to understanding this is to keep in mind that:
        >
        > (1) "*" is greedy; and
        > (2) "/g" is defined as "Global matching, and keep the Current position
        > after failed matching."
        >
        > Try something like this if you want the gory details:
        >
        > $ echo -n abc | perl -Mre=debug -ne 's/[ac]*/($&)/g'
        >
        > > - The Python (or sed) way: “X*” matches as many “X” as possible, and
        > > includes the following empty string.
        > >
        > > Starting empty strings are always included. So, the Perl way gives (I
        > > use Ruby, since I can’t speak Perl):
        > >
        > > puts 'abc'.gsub(/[ac]*/, '(\0)')
        > > # returns “(a)()b(c)()”, really “(ϵa)(ϵ)b(ϵc)(ϵ)”
        >
        > Same thing with Ruby: there's a current position pointer, keeping
        > track of the current match.
        >
        > > And the Python way:
        > >
        > > import re
        > > print re.sub(re.compile('(a*)'), '(\\1)', 'abc')
        > > # returns “(a)b(c)”, really “(ϵaϵ)b(ϵcϵ)”
        >
        > With Python, re.sub() "return[s] the string obtained by replacing
        > the leftmost non-overlapping occurrences of pattern in string by the
        > replacement repl". It's the same thing, except for an optimisation:
        > "empty matches are included in the result unless they touch the
        > beginning of another match".
        >
        > > (Note that adding “$” to the patterns doesn’t change anything.)
        > >
        > > Now, VimL works in the Perl way, except that “*” includes the empty
        > > string if it is the last one in the string:
        > >
        > > echo substitute('abc', '[ac]*', '(\0)', 'g')
        > > " returns “(a)()b(c)”, really “(ϵa)(ϵ)b(ϵcϵ)”
        >
        > Again the same thing, except the optimisation above is applied only
        > at the end of the string.

        Yes. My question simply was: is it consistent to optimize only at the
        end?

        > > As far as I’m concerned, I find the Perl way quite counter-intuitive,
        > > but what I’m interested in here is whether VimL is consistent or not.
        > > I.e., shouldn’t it work clearly one way or the other?
        >
        > You came up with the concept of "ϵ", you fix its limitations. :)

        The “metaphysics of empty substrings”, the “concept of ϵ”... please, I
        know I’m French, but that doesn’t mean I subscribe to French Theory! :)

        > My conclusion to the above comparison is that Vim should apply the
        > same optimisation in full, that is, kill the empty matches that touch
        > the beginning of another match. As far as I can tell, that would be
        > safe for both the old and the new regexp engines.

        I prefer it that way too. But I’d prefer no optimization rather than
        conditional optimization, as is the case now.

        Best,
        Paul

        --
        --
        You received this message from the "vim_use" maillist.
        Do not top-post! Type your reply below the text you are replying to.
        For more information, visit http://www.vim.org/maillist.php

        ---
        You received this message because you are subscribed to the Google Groups "vim_use" group.
        To unsubscribe from this group and stop receiving emails from it, send an email to vim_use+unsubscribe@....
        For more options, visit https://groups.google.com/groups/opt_out.
      • Erik Christiansen
        ... Now that s a darned good idea! ... + 1 million Quite a few years ago, I built Vim with a proper regex library from another FOSS project. It provided posix
        Message 3 of 10 , Jun 19, 2013
          On 18.06.13 18:05, Paul Isambert wrote:
          > Mapping “/” to “/\v” (and, slightly more difficult, “:s/” to “:s/\v”)
          > is something I’ve thought abouth doing many times but have never done,
          > for some reason.

          Now that's a darned good idea!

          > I wish there were a “verymagic” option by default, I would have turned
          > it on a long time ago.

          + 1 million

          Quite a few years ago, I built Vim with a proper regex library from
          another FOSS project. It provided posix ERE behaviour, which worked
          beautifully, except that the Vim help broke. Maybe I should have found
          the time to debug that.

          Erik

          --
          Nowlan's Theory:
          He who hesitates is not only lost, but several miles from the next
          freeway exit.

          --
          --
          You received this message from the "vim_use" maillist.
          Do not top-post! Type your reply below the text you are replying to.
          For more information, visit http://www.vim.org/maillist.php

          ---
          You received this message because you are subscribed to the Google Groups "vim_use" group.
          To unsubscribe from this group and stop receiving emails from it, send an email to vim_use+unsubscribe@....
          For more options, visit https://groups.google.com/groups/opt_out.
        Your message has been successfully submitted and would be delivered to recipients shortly.