Loading ...
Sorry, an error occurred while loading the content.

Dealing with empty strings in regexp.

Expand Messages
  • Paul Isambert
    Hello all, The following issue has been recently discussed on the Lua mailing list: http://lua-users.org/lists/lua-l/2013-04/msg00812.html (It has also been
    Message 1 of 10 , Jun 18, 2013
    • 0 Attachment
      Hello all,

      The following issue has been recently discussed on the Lua mailing list:
      http://lua-users.org/lists/lua-l/2013-04/msg00812.html

      (It has also been independantly raised on the LuaTeX list:
      http://tug.org/pipermail/luatex/2013-June/004418.html)

      If I understand correctly, any string can be represented with
      interspersed empty substrings. E.g. “abc” is really “ϵaϵbϵcϵ”, where
      “ϵ” is the empty string. Now, there seems to be two ways to deal with
      those empty strings in regexps, especially regarding the “*” operator:

      - The Perl way: “X*” matches as many “X” as possible, and does not
      include the following empty string.
      - The Python (or sed) way: “X*” matches as many “X” as possible, and
      includes the following empty string.

      Starting empty strings are always included. So, the Perl way gives (I
      use Ruby, since I can’t speak Perl):

      puts 'abc'.gsub(/[ac]*/, '(\0)')
      # returns “(a)()b(c)()”, really “(ϵa)(ϵ)b(ϵc)(ϵ)”

      And the Python way:

      import re
      print re.sub(re.compile('(a*)'), '(\\1)', 'abc')
      # returns “(a)b(c)”, really “(ϵaϵ)b(ϵcϵ)”

      (Note that adding “$” to the patterns doesn’t change anything.)

      Now, VimL works in the Perl way, except that “*” includes the empty
      string if it is the last one in the string:

      echo substitute('abc', '[ac]*', '(\0)', 'g')
      " returns “(a)()b(c)”, really “(ϵa)(ϵ)b(ϵcϵ)”

      As far as I’m concerned, I find the Perl way quite counter-intuitive,
      but what I’m interested in here is whether VimL is consistent or not.
      I.e., shouldn’t it work clearly one way or the other?

      Best,
      Paul

      --
      --
      You received this message from the "vim_use" maillist.
      Do not top-post! Type your reply below the text you are replying to.
      For more information, visit http://www.vim.org/maillist.php

      ---
      You received this message because you are subscribed to the Google Groups "vim_use" group.
      To unsubscribe from this group and stop receiving emails from it, send an email to vim_use+unsubscribe@....
      For more options, visit https://groups.google.com/groups/opt_out.
    • Paul Isambert
      Sorry, this ... should be print re.sub(re.compile( ([ac]*) ), ( 1) , abc ) Paul -- -- You received this message from the vim_use maillist. Do not
      Message 2 of 10 , Jun 18, 2013
      • 0 Attachment
        Sorry, this

        > print re.sub(re.compile('(a*)'), '(\\1)', 'abc')

        should be

        print re.sub(re.compile('([ac]*)'), '(\\1)', 'abc')

        Paul

        --
        --
        You received this message from the "vim_use" maillist.
        Do not top-post! Type your reply below the text you are replying to.
        For more information, visit http://www.vim.org/maillist.php

        ---
        You received this message because you are subscribed to the Google Groups "vim_use" group.
        To unsubscribe from this group and stop receiving emails from it, send an email to vim_use+unsubscribe@....
        For more options, visit https://groups.google.com/groups/opt_out.
      • John Little
        ... I don t understand this interspersed empty substrings way of looking at regexes; I suspect that it doesn t make sense some of the time, and is not
        Message 3 of 10 , Jun 18, 2013
        • 0 Attachment
          On Tuesday, June 18, 2013 11:19:43 PM UTC+12, Paul Isambert wrote:

          > I.e., shouldn’t it work clearly one way or the other?

          I don't understand this "interspersed empty substrings" way of looking at regexes; I suspect that it doesn't make sense some of the time, and is not useful, but my suspicions may obviously stem from my incomprehension.

          A pattern like [ac]* on its own matches everywhere; so vim does the substitution everywhere. Why is that not intuitive? Anyway, as I see it, vim is consistent.

          Doing substitutions with a pattern that matches the empty string is not useful, in real editing tasks it's not what is wanted. One is always trying to match *something*.

          Regards, John Little

          --
          --
          You received this message from the "vim_use" maillist.
          Do not top-post! Type your reply below the text you are replying to.
          For more information, visit http://www.vim.org/maillist.php

          ---
          You received this message because you are subscribed to the Google Groups "vim_use" group.
          To unsubscribe from this group and stop receiving emails from it, send an email to vim_use+unsubscribe@....
          For more options, visit https://groups.google.com/groups/opt_out.
        • Paul Isambert
          ... The empty substrings are only a means to account for the difference between Perl-like and Python-like languages; it is interesting only inasmuch as it
          Message 4 of 10 , Jun 18, 2013
          • 0 Attachment
            John Little <John.B.Little@...> a écrit:
            > On Tuesday, June 18, 2013 11:19:43 PM UTC+12, Paul Isambert wrote:
            >
            > > I.e., shouldn’t it work clearly one way or the other?
            >
            > I don't understand this "interspersed empty substrings" way of
            > looking at regexes; I suspect that it doesn't make sense some of the
            > time, and is not useful, but my suspicions may obviously stem from my
            > incomprehension.

            The empty substrings are only a means to account for the difference
            between Perl-like and Python-like languages; it is interesting only
            inasmuch as it achieves that, and shouldn’t be extended to understand
            regexps any further. (Dirk Laurie formalizes that with open/closed
            intervals here: http://lua-users.org/lists/lua-l/2013-04/msg00869.html.)

            > A pattern like [ac]* on its own matches everywhere; so vim does the
            > substitution everywhere. Why is that not intuitive? Anyway, as I see
            > it, vim is consistent.

            The issue is what “everywhere” means. Perl-like languages include
            “just after a successful match”, hence “(a)()b(c)()”, Python-like
            ones do not, hence “(a)b(c)”. The presumed inconsistency in VimL is
            that it includes “just after a successful match”, unless we’re at the
            end of the string, hence “(a)()b(c)”.

            > Doing substitutions with a pattern that matches the empty string is
            > not useful, in real editing tasks it's not what is wanted. One is
            > always trying to match *something*.

            The “*” operator should be banned, then!

            Best,
            Paul

            --
            --
            You received this message from the "vim_use" maillist.
            Do not top-post! Type your reply below the text you are replying to.
            For more information, visit http://www.vim.org/maillist.php

            ---
            You received this message because you are subscribed to the Google Groups "vim_use" group.
            To unsubscribe from this group and stop receiving emails from it, send an email to vim_use+unsubscribe@....
            For more options, visit https://groups.google.com/groups/opt_out.
          • Erik Christiansen
            ... Does the problem with matching empty strings arise from using * when + should be used instead? You are presumably aware that¹: * = 0 or more of the
            Message 5 of 10 , Jun 18, 2013
            • 0 Attachment
              On 18.06.13 14:51, Paul Isambert wrote:
              > The “*” operator should be banned, then!

              Does the problem with matching empty strings arise from using "*" when
              "+" should be used instead? You are presumably aware that¹:

              * = 0 or more of the preceding atom.
              + = 1 or more of the preceding atom.

              Thus "(a|b)+" means one or more a or b characters, and cannot match the
              empty string. Use "*" instead, and you've instructed it to also match "".

              There are many regex dialects - enough to fill a fat O'Reilly book, and
              enough to make anyone's head hurt. One way to minimise the confusion is
              to cultivate fluency in one dialect, and eschew the others.

              Having long ago found posix BREs annoyingly full of superfluous
              backslashes, I've settled for the more concise and powerful posix EREs.
              Also, "man 7 regex" agrees that BREs are obsolete. (To get away from
              obsolete regexes in vim, prefix regexes with "\v". That is a good
              approximation of posix EREs, and so is consistent with many *nix
              utilities, so you can effortlessly switch from awk, bash, egrep,
              procmail, etc, etc, to vim with "\v".)

              Erik

              ¹ In posix EREs, and most others, though in some vim modes, "+" isn't
              "magic". Those obsolete regex modes are worth avoiding.

              --
              Leibowitz's Rule:
              When hammering a nail, you will never hit your finger if you hold the
              hammer with both hands.

              --
              --
              You received this message from the "vim_use" maillist.
              Do not top-post! Type your reply below the text you are replying to.
              For more information, visit http://www.vim.org/maillist.php

              ---
              You received this message because you are subscribed to the Google Groups "vim_use" group.
              To unsubscribe from this group and stop receiving emails from it, send an email to vim_use+unsubscribe@....
              For more options, visit https://groups.google.com/groups/opt_out.
            • Ben Fritz
              ... The * should not be used by itself. But it is very useful in combination with other stuff. For example, I use a tool which generates variable names
              Message 6 of 10 , Jun 18, 2013
              • 0 Attachment
                On Tuesday, June 18, 2013 7:51:06 AM UTC-5, Paul Isambert wrote:
                >
                > > Doing substitutions with a pattern that matches the empty string is
                >
                > > not useful, in real editing tasks it's not what is wanted. One is
                >
                > > always trying to match *something*.
                >
                >
                >
                > The “*” operator should be banned, then!
                >

                The * should not be used by itself. But it is very useful in combination with other stuff.

                For example, I use a tool which generates variable names automatically from graphically-created GUI widgets. I'm sometimes not sure whether there is a single _, two __, or none at all between two parts of a variable name, so I'll search for something like "firstpart_*secondpart". Other examples (not even auto-generated) are when there might be a word in between, like "firstpart\w*secondpart".

                --
                --
                You received this message from the "vim_use" maillist.
                Do not top-post! Type your reply below the text you are replying to.
                For more information, visit http://www.vim.org/maillist.php

                ---
                You received this message because you are subscribed to the Google Groups "vim_use" group.
                To unsubscribe from this group and stop receiving emails from it, send an email to vim_use+unsubscribe@....
                For more options, visit https://groups.google.com/groups/opt_out.
              • Paul Isambert
                ... I’ve no problem with the empty string in itself. Rather, my original question was to know where there was an empty string, and whether VimL follows Perl
                Message 7 of 10 , Jun 18, 2013
                • 0 Attachment
                  Erik Christiansen <dvalin@...> a écrit:
                  > On 18.06.13 14:51, Paul Isambert wrote:
                  > > The “*” operator should be banned, then!
                  >
                  > Does the problem with matching empty strings arise from using "*" when
                  > "+" should be used instead? You are presumably aware that¹:
                  >
                  > * = 0 or more of the preceding atom.
                  > + = 1 or more of the preceding atom.
                  >
                  > Thus "(a|b)+" means one or more a or b characters, and cannot match the
                  > empty string. Use "*" instead, and you've instructed it to also match "".

                  I’ve no problem with the empty string in itself. Rather, my original
                  question was to know where there was an empty string, and whether VimL
                  follows Perl or Python in that respect. Of course I use “+” when
                  necessary.

                  > There are many regex dialects - enough to fill a fat O'Reilly book, and
                  > enough to make anyone's head hurt. One way to minimise the confusion is
                  > to cultivate fluency in one dialect, and eschew the others.
                  >
                  > Having long ago found posix BREs annoyingly full of superfluous
                  > backslashes, I've settled for the more concise and powerful posix EREs.
                  > Also, "man 7 regex" agrees that BREs are obsolete. (To get away from
                  > obsolete regexes in vim, prefix regexes with "\v". That is a good
                  > approximation of posix EREs, and so is consistent with many *nix
                  > utilities, so you can effortlessly switch from awk, bash, egrep,
                  > procmail, etc, etc, to vim with "\v".)

                  Mapping “/” to “/\v” (and, slightly more difficult, “:s/” to “:s/\v”)
                  is something I’ve thought abouth doing many times but have never done,
                  for some reason. I wish there were a “verymagic” option by default, I
                  would have turned it on a long time ago.

                  Best,
                  Paul

                  --
                  --
                  You received this message from the "vim_use" maillist.
                  Do not top-post! Type your reply below the text you are replying to.
                  For more information, visit http://www.vim.org/maillist.php

                  ---
                  You received this message because you are subscribed to the Google Groups "vim_use" group.
                  To unsubscribe from this group and stop receiving emails from it, send an email to vim_use+unsubscribe@....
                  For more options, visit https://groups.google.com/groups/opt_out.
                • LCD 47
                  ... You re making up a metaphysics of empty substrings. I humbly submit that there is no such thing in the programming languages you mention (don t know about
                  Message 8 of 10 , Jun 18, 2013
                  • 0 Attachment
                    On 18 June 2013, Paul Isambert <zappathustra@...> wrote:
                    > Hello all,
                    >
                    > The following issue has been recently discussed on the Lua mailing list:
                    > http://lua-users.org/lists/lua-l/2013-04/msg00812.html
                    >
                    > (It has also been independantly raised on the LuaTeX list:
                    > http://tug.org/pipermail/luatex/2013-June/004418.html)
                    >
                    > If I understand correctly, any string can be represented with
                    > interspersed empty substrings. E.g. “abc” is really “ϵaϵbϵcϵ”, where
                    > “ϵ” is the empty string. Now, there seems to be two ways to deal with
                    > those empty strings in regexps, especially regarding the “*” operator:

                    You're making up a metaphysics of empty substrings. I humbly submit
                    that there is no such thing in the programming languages you mention
                    (don't know about Lua though).

                    > - The Perl way: “X*” matches as many “X” as possible, and does not
                    > include the following empty string.

                    $ echo -n abc | perl -pe 's/[ac]*/($&)/g'
                    (a)()b(c)()

                    The key to understanding this is to keep in mind that:

                    (1) "*" is greedy; and
                    (2) "/g" is defined as "Global matching, and keep the Current position
                    after failed matching."

                    Try something like this if you want the gory details:

                    $ echo -n abc | perl -Mre=debug -ne 's/[ac]*/($&)/g'

                    > - The Python (or sed) way: “X*” matches as many “X” as possible, and
                    > includes the following empty string.
                    >
                    > Starting empty strings are always included. So, the Perl way gives (I
                    > use Ruby, since I can’t speak Perl):
                    >
                    > puts 'abc'.gsub(/[ac]*/, '(\0)')
                    > # returns “(a)()b(c)()”, really “(ϵa)(ϵ)b(ϵc)(ϵ)”

                    Same thing with Ruby: there's a current position pointer, keeping
                    track of the current match.

                    > And the Python way:
                    >
                    > import re
                    > print re.sub(re.compile('(a*)'), '(\\1)', 'abc')
                    > # returns “(a)b(c)”, really “(ϵaϵ)b(ϵcϵ)”

                    With Python, re.sub() "return[s] the string obtained by replacing
                    the leftmost non-overlapping occurrences of pattern in string by the
                    replacement repl". It's the same thing, except for an optimisation:
                    "empty matches are included in the result unless they touch the
                    beginning of another match".

                    > (Note that adding “$” to the patterns doesn’t change anything.)
                    >
                    > Now, VimL works in the Perl way, except that “*” includes the empty
                    > string if it is the last one in the string:
                    >
                    > echo substitute('abc', '[ac]*', '(\0)', 'g')
                    > " returns “(a)()b(c)”, really “(ϵa)(ϵ)b(ϵcϵ)”

                    Again the same thing, except the optimisation above is applied only
                    at the end of the string.

                    > As far as I’m concerned, I find the Perl way quite counter-intuitive,
                    > but what I’m interested in here is whether VimL is consistent or not.
                    > I.e., shouldn’t it work clearly one way or the other?

                    You came up with the concept of "ϵ", you fix its limitations. :)

                    My conclusion to the above comparison is that Vim should apply the
                    same optimisation in full, that is, kill the empty matches that touch
                    the beginning of another match. As far as I can tell, that would be
                    safe for both the old and the new regexp engines.

                    /lcd

                    --
                    --
                    You received this message from the "vim_use" maillist.
                    Do not top-post! Type your reply below the text you are replying to.
                    For more information, visit http://www.vim.org/maillist.php

                    ---
                    You received this message because you are subscribed to the Google Groups "vim_use" group.
                    To unsubscribe from this group and stop receiving emails from it, send an email to vim_use+unsubscribe@....
                    For more options, visit https://groups.google.com/groups/opt_out.
                  • Paul Isambert
                    ... As I’ve already said, the empty strings were just meant to capture the differences between languages. I did not mean to imply that those substrings have
                    Message 9 of 10 , Jun 18, 2013
                    • 0 Attachment
                      LCD 47 <lcd047@...> a écrit:
                      > On 18 June 2013, Paul Isambert <zappathustra@...> wrote:
                      > > Hello all,
                      > >
                      > > The following issue has been recently discussed on the Lua mailing list:
                      > > http://lua-users.org/lists/lua-l/2013-04/msg00812.html
                      > >
                      > > (It has also been independantly raised on the LuaTeX list:
                      > > http://tug.org/pipermail/luatex/2013-June/004418.html)
                      > >
                      > > If I understand correctly, any string can be represented with
                      > > interspersed empty substrings. E.g. “abc” is really “ϵaϵbϵcϵ”, where
                      > > “ϵ” is the empty string. Now, there seems to be two ways to deal with
                      > > those empty strings in regexps, especially regarding the “*” operator:
                      >
                      > You're making up a metaphysics of empty substrings. I humbly submit
                      > that there is no such thing in the programming languages you mention
                      > (don't know about Lua though).

                      As I’ve already said, the empty strings were just meant to capture the
                      differences between languages. I did not mean to imply that those
                      substrings have any kind of reality.

                      > > - The Perl way: “X*” matches as many “X” as possible, and does not
                      > > include the following empty string.
                      >
                      > $ echo -n abc | perl -pe 's/[ac]*/($&)/g'
                      > (a)()b(c)()
                      >
                      > The key to understanding this is to keep in mind that:
                      >
                      > (1) "*" is greedy; and
                      > (2) "/g" is defined as "Global matching, and keep the Current position
                      > after failed matching."
                      >
                      > Try something like this if you want the gory details:
                      >
                      > $ echo -n abc | perl -Mre=debug -ne 's/[ac]*/($&)/g'
                      >
                      > > - The Python (or sed) way: “X*” matches as many “X” as possible, and
                      > > includes the following empty string.
                      > >
                      > > Starting empty strings are always included. So, the Perl way gives (I
                      > > use Ruby, since I can’t speak Perl):
                      > >
                      > > puts 'abc'.gsub(/[ac]*/, '(\0)')
                      > > # returns “(a)()b(c)()”, really “(ϵa)(ϵ)b(ϵc)(ϵ)”
                      >
                      > Same thing with Ruby: there's a current position pointer, keeping
                      > track of the current match.
                      >
                      > > And the Python way:
                      > >
                      > > import re
                      > > print re.sub(re.compile('(a*)'), '(\\1)', 'abc')
                      > > # returns “(a)b(c)”, really “(ϵaϵ)b(ϵcϵ)”
                      >
                      > With Python, re.sub() "return[s] the string obtained by replacing
                      > the leftmost non-overlapping occurrences of pattern in string by the
                      > replacement repl". It's the same thing, except for an optimisation:
                      > "empty matches are included in the result unless they touch the
                      > beginning of another match".
                      >
                      > > (Note that adding “$” to the patterns doesn’t change anything.)
                      > >
                      > > Now, VimL works in the Perl way, except that “*” includes the empty
                      > > string if it is the last one in the string:
                      > >
                      > > echo substitute('abc', '[ac]*', '(\0)', 'g')
                      > > " returns “(a)()b(c)”, really “(ϵa)(ϵ)b(ϵcϵ)”
                      >
                      > Again the same thing, except the optimisation above is applied only
                      > at the end of the string.

                      Yes. My question simply was: is it consistent to optimize only at the
                      end?

                      > > As far as I’m concerned, I find the Perl way quite counter-intuitive,
                      > > but what I’m interested in here is whether VimL is consistent or not.
                      > > I.e., shouldn’t it work clearly one way or the other?
                      >
                      > You came up with the concept of "ϵ", you fix its limitations. :)

                      The “metaphysics of empty substrings”, the “concept of ϵ”... please, I
                      know I’m French, but that doesn’t mean I subscribe to French Theory! :)

                      > My conclusion to the above comparison is that Vim should apply the
                      > same optimisation in full, that is, kill the empty matches that touch
                      > the beginning of another match. As far as I can tell, that would be
                      > safe for both the old and the new regexp engines.

                      I prefer it that way too. But I’d prefer no optimization rather than
                      conditional optimization, as is the case now.

                      Best,
                      Paul

                      --
                      --
                      You received this message from the "vim_use" maillist.
                      Do not top-post! Type your reply below the text you are replying to.
                      For more information, visit http://www.vim.org/maillist.php

                      ---
                      You received this message because you are subscribed to the Google Groups "vim_use" group.
                      To unsubscribe from this group and stop receiving emails from it, send an email to vim_use+unsubscribe@....
                      For more options, visit https://groups.google.com/groups/opt_out.
                    • Erik Christiansen
                      ... Now that s a darned good idea! ... + 1 million Quite a few years ago, I built Vim with a proper regex library from another FOSS project. It provided posix
                      Message 10 of 10 , Jun 19, 2013
                      • 0 Attachment
                        On 18.06.13 18:05, Paul Isambert wrote:
                        > Mapping “/” to “/\v” (and, slightly more difficult, “:s/” to “:s/\v”)
                        > is something I’ve thought abouth doing many times but have never done,
                        > for some reason.

                        Now that's a darned good idea!

                        > I wish there were a “verymagic” option by default, I would have turned
                        > it on a long time ago.

                        + 1 million

                        Quite a few years ago, I built Vim with a proper regex library from
                        another FOSS project. It provided posix ERE behaviour, which worked
                        beautifully, except that the Vim help broke. Maybe I should have found
                        the time to debug that.

                        Erik

                        --
                        Nowlan's Theory:
                        He who hesitates is not only lost, but several miles from the next
                        freeway exit.

                        --
                        --
                        You received this message from the "vim_use" maillist.
                        Do not top-post! Type your reply below the text you are replying to.
                        For more information, visit http://www.vim.org/maillist.php

                        ---
                        You received this message because you are subscribed to the Google Groups "vim_use" group.
                        To unsubscribe from this group and stop receiving emails from it, send an email to vim_use+unsubscribe@....
                        For more options, visit https://groups.google.com/groups/opt_out.
                      Your message has been successfully submitted and would be delivered to recipients shortly.