Loading ...
Sorry, an error occurred while loading the content.

Re: Understanding regxp implementation

Expand Messages
  • Nikolai Weibull
    ... Because that s how Henry Spencer did it. I don t know his reasoning, though. It s probably a case of trying to limit memory use and avoiding a
    Message 1 of 13 , Mar 20, 2007
    View Source
    • 0 Attachment
      On 3/20/07, Asiri Rathnayake <asiri.rathnayake@...> wrote:

      > I went through the regxp code and have a few questions...
      >
      > First, Why use this kind of a coding scheme and encode patterns rather
      > than using a state diagram ? ( Performance/Memory ? ).

      Because that's how Henry Spencer did it. I don't know his reasoning,
      though. It's probably a case of trying to limit memory use and
      avoiding a conceptually more complex data structure, but not exactly
      sure. You can find his original sources here:

      http://arglist.com/regex/

      > Secondly, is it a
      > requirement that the new implementation has to follow the same method ?
      > I mean, can't I use a state diagram ( which is easy to implement in my
      > opinion ) to simulate the NFA ?

      I thought the point was that you can use whatever method you want?
      And that you might use an already existing library, like TRE.

      In TRE 0.7.2 (I think - might have been 0.7.0), Ville added support
      for more general input. This means that you can provide an object
      that feeds the regex matcher with input. In Vim's case that would be
      feeding it the contents of successive memlines until a match is found
      or the buffer is "depleted". That should be a quite simple way of
      using TRE in Vim.

      The main problem with using TRE, however, is that it only works with
      "ASCII" or wide character input. That means that it isn't well-suited
      for Vim's internal buffer format of just keeping bytes around in
      whatever encoding they may be. To use TRE you'd have to transcode the
      buffer's bytes to wchar_t's (wide characters that are hopefully
      interpreted as Unicode characters by the standard C library on the
      system we're running on), depending on the buffers encoding.

      Perhaps working on a Vim-specific matcher for TRE would be valuable.

      I would suggest that you subscribe to the TRE mailing list. I haven't
      asked Ville what editor he prefers, but considering his indentation
      style, it's probably Emacs :-(. But perhaps he's willing to help us
      out anyway?

      http://laurikari.net/mailman/listinfo/tre-general

      nikolai
    • Asiri Rathnayake
      This mail bounced off for some reason, I m repeating it. Sorry if you ve already got this. ... Hi Bram, Nikolai, All, I think the best way to understand
      Message 2 of 13 , Mar 21, 2007
      View Source
      • 0 Attachment
        This mail bounced off for some reason, I'm repeating it. Sorry if you've
        already got this.

        -----------------

        Hi Bram, Nikolai, All,

        I think the best way to understand current implementation of regxp is to
        first go through Henry Spencer's original regxp implementation ( thanks
        nikolai ). It's very compact and easy to mess with. After that, I think
        I
        would be able to come up with a hack to include new code into vim's
        regxp
        code base ( either TRE or our own implementation ). By the way, I have
        decided to apply for the GSoc ( I beleive I can crack this with the help
        of
        all of you ). I'm really greateful for the help provided by Nikolai and
        Bram, Thank you very much!

        ps : even if I'm not selected, I would still like to contribute to this
        project... :-)

        - Asiri
      • Asiri Rathnayake
        Nikolai, As you might know, the reg_comp() method is called twice when compiling a r.e; first to determine the size of the compiled expression and then to
        Message 3 of 13 , Mar 22, 2007
        View Source
        • 0 Attachment
          Nikolai,

          As you might know, the reg_comp() method is called twice when compiling
          a r.e; first to determine the size of the compiled expression and then
          to actually compile it. I was thinking if this can be used to our
          advantage, while on the first pass, we look for occurrences of special
          characters and set a flag in regprog_T appropriately. If such thing was
          not found, we branch off the second pass into one of our own routines to
          compile the expression into our own structures (say, a state diagram).
          And we have to change other functions a bit to look for the above flag
          and call new routines appropriately. What do you think ?

          - Asiri
        • Nikolai Weibull
          ... That sounds like a good way of determining whether the old engine will be required or if a new one (with more limited functionality) should be used. One
          Message 4 of 13 , Mar 22, 2007
          View Source
          • 0 Attachment
            On 3/22/07, Asiri Rathnayake <asiri.rathnayake@...> wrote:

            > As you might know, the reg_comp() method is called twice when compiling
            > a r.e; first to determine the size of the compiled expression and then
            > to actually compile it. I was thinking if this can be used to our
            > advantage, while on the first pass, we look for occurrences of special
            > characters and set a flag in regprog_T appropriately. If such thing was
            > not found, we branch off the second pass into one of our own routines to
            > compile the expression into our own structures (say, a state diagram).
            > And we have to change other functions a bit to look for the above flag
            > and call new routines appropriately. What do you think ?

            That sounds like a good way of determining whether the old engine will
            be required or if a new one (with more "limited" functionality) should
            be used. One way of keeping this information as local as possible
            would be to keep a set of function pointers with the compiled regex
            that point to the appropriate functions to execute them on some input.

            For example, you could have something like this:

            typedef struct
            {
            int (*exec)();
            int regstart;
            char_u reganch;
            char_u *regmust;
            int regmlen;
            unsigned regflags;
            char_u reghasz;
            char_u program[1]; /* actually longer.. */
            } regprog_T;

            and change vim_regexec() to call the exec() function of the regprog_T
            in the regmatch_T that it gets passed. You'd then set exec() to point
            to either vim_old_regexec() or vim_new_regexec() (or similarly named
            functions) in vim_regcomp() depending on the type of regex we have. I
            guess it could be some flag field as well, but this makes it possible
            to add a third matcher, should we so desire...like a
            Boyer-Moore-Horspool matcher for fixed strings.

            nikolai
          • Asiri Rathnayake
            ... Yes, this is more flexible. thanks. - Asiri
            Message 5 of 13 , Mar 22, 2007
            View Source
            • 0 Attachment
              On Thu, 2007-03-22 at 09:26 +0100, Nikolai Weibull wrote:
              > On 3/22/07, Asiri Rathnayake <asiri.rathnayake@...> wrote:
              >
              > > As you might know, the reg_comp() method is called twice when compiling
              > > a r.e; first to determine the size of the compiled expression and then
              > > to actually compile it. I was thinking if this can be used to our
              > > advantage, while on the first pass, we look for occurrences of special
              > > characters and set a flag in regprog_T appropriately. If such thing was
              > > not found, we branch off the second pass into one of our own routines to
              > > compile the expression into our own structures (say, a state diagram).
              > > And we have to change other functions a bit to look for the above flag
              > > and call new routines appropriately. What do you think ?
              >
              > That sounds like a good way of determining whether the old engine will
              > be required or if a new one (with more "limited" functionality) should
              > be used. One way of keeping this information as local as possible
              > would be to keep a set of function pointers with the compiled regex
              > that point to the appropriate functions to execute them on some input.
              >
              > For example, you could have something like this:
              >
              > typedef struct
              > {
              > int (*exec)();
              > int regstart;
              > char_u reganch;
              > char_u *regmust;
              > int regmlen;
              > unsigned regflags;
              > char_u reghasz;
              > char_u program[1]; /* actually longer.. */
              > } regprog_T;
              >
              > and change vim_regexec() to call the exec() function of the regprog_T
              > in the regmatch_T that it gets passed. You'd then set exec() to point
              > to either vim_old_regexec() or vim_new_regexec() (or similarly named
              > functions) in vim_regcomp() depending on the type of regex we have. I
              > guess it could be some flag field as well, but this makes it possible
              > to add a third matcher, should we so desire...like a
              > Boyer-Moore-Horspool matcher for fixed strings.

              Yes, this is more flexible. thanks.

              - Asiri

              > nikolai
            • Bram Moolenaar
              ... Adding a third matcher won t happen soon, and is a big change. It s not really needed to prepare for that. The disadvantage of using a function pointer is
              Message 6 of 13 , Mar 22, 2007
              View Source
              • 0 Attachment
                Nikolai Weibull wrote:

                > On 3/22/07, Asiri Rathnayake <asiri.rathnayake@...> wrote:
                >
                > > As you might know, the reg_comp() method is called twice when compiling
                > > a r.e; first to determine the size of the compiled expression and then
                > > to actually compile it. I was thinking if this can be used to our
                > > advantage, while on the first pass, we look for occurrences of special
                > > characters and set a flag in regprog_T appropriately. If such thing was
                > > not found, we branch off the second pass into one of our own routines to
                > > compile the expression into our own structures (say, a state diagram).
                > > And we have to change other functions a bit to look for the above flag
                > > and call new routines appropriately. What do you think ?
                >
                > That sounds like a good way of determining whether the old engine will
                > be required or if a new one (with more "limited" functionality) should
                > be used. One way of keeping this information as local as possible
                > would be to keep a set of function pointers with the compiled regex
                > that point to the appropriate functions to execute them on some input.
                >
                > For example, you could have something like this:
                >
                > typedef struct
                > {
                > int (*exec)();
                > int regstart;
                > char_u reganch;
                > char_u *regmust;
                > int regmlen;
                > unsigned regflags;
                > char_u reghasz;
                > char_u program[1]; /* actually longer.. */
                > } regprog_T;
                >
                > and change vim_regexec() to call the exec() function of the regprog_T
                > in the regmatch_T that it gets passed. You'd then set exec() to point
                > to either vim_old_regexec() or vim_new_regexec() (or similarly named
                > functions) in vim_regcomp() depending on the type of regex we have. I
                > guess it could be some flag field as well, but this makes it possible
                > to add a third matcher, should we so desire...like a
                > Boyer-Moore-Horspool matcher for fixed strings.

                Adding a third matcher won't happen soon, and is a big change. It's not
                really needed to prepare for that.

                The disadvantage of using a function pointer is that in the place where
                it's used you only see:

                myprog->exec(args);

                You can't see which function is being called, and finding out is not
                that simple. So when you browse the code this is like a dead end.

                Using this keeps navigating much simpler:

                if (myprog->difficultregexp)
                regmatch_old(args);
                else
                regmatch_new(args);

                One reason why inheritance in object oriented programming makes our life
                more difficult is that you quite often don't know for sure which method
                is invoked.

                --
                ROBIN: The what?
                ARTHUR: The Holy Hand Grenade of Antioch. 'Tis one of the sacred relics
                Brother Maynard always carries with him.
                ALL: Yes. Of course.
                ARTHUR: (shouting) Bring up the Holy Hand Grenade!
                "Monty Python and the Holy Grail" PYTHON (MONTY) PICTURES LTD

                /// Bram Moolenaar -- Bram@... -- http://www.Moolenaar.net \\\
                /// sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ \\\
                \\\ download, build and distribute -- http://www.A-A-P.org ///
                \\\ help me help AIDS victims -- http://ICCF-Holland.org ///
              • Asiri Rathnayake
                ... Actually this is something I loved about OOP, you don t have to worry about which method is called (assuming the correct one gets called). But given the
                Message 7 of 13 , Mar 22, 2007
                View Source
                • 0 Attachment
                  On Thu, 2007-03-22 at 22:21 +0100, Bram Moolenaar wrote:

                  >
                  > Adding a third matcher won't happen soon, and is a big change. It's not
                  > really needed to prepare for that.
                  >
                  > The disadvantage of using a function pointer is that in the place where
                  > it's used you only see:
                  >
                  > myprog->exec(args);
                  >
                  > You can't see which function is being called, and finding out is not
                  > that simple. So when you browse the code this is like a dead end.
                  >
                  > Using this keeps navigating much simpler:
                  >
                  > if (myprog->difficultregexp)
                  > regmatch_old(args);
                  > else
                  > regmatch_new(args);
                  >
                  > One reason why inheritance in object oriented programming makes our life
                  > more difficult is that you quite often don't know for sure which method
                  > is invoked.

                  Actually this is something I loved about OOP, you don't have to worry
                  about which method is called (assuming the correct one gets called). But
                  given the complexity of the existing code, adding more complications
                  would be a disaster as you say.

                  - Asiri
                Your message has been successfully submitted and would be delivered to recipients shortly.