Loading ...
Sorry, an error occurred while loading the content.

GSoC Regexp engine

Expand Messages
  • Ian Young
    Hi all, I m Ian, one of the two students working on improving the regexp engine in Vim for this year s Google Summer of Code. I haven t had a whole lot to
    Message 1 of 7 , May 31, 2007
    • 0 Attachment
      Hi all,
      I'm Ian, one of the two students working on improving the regexp
      engine in Vim for this year's Google Summer of Code. I haven't had a
      whole lot to contribute as of yet, but now that work is underway, I'll
      probably pop up here asking lots of questions some days.

      Right now we're working on getting things set up and building a
      testing suite, but I thought I would spark some discussion on a design
      decision that will be coming up after we finish this phase, which is
      whether to implement the new model ourselves, or use an alternative
      engine, like TRE: <http://laurikari.net/tre/>. I'm tempted to
      implement one ourselves, as it's an intellectually stimulating
      prospect, but that doesn't mean I won't listen to reason if TRE or
      another option is far better. I don't know much about the internals of
      TRE, but according to previous posts to this list, it utilizes three
      engines: a slow one for handling backreferences (presumably similar to
      Vim's current engine), a fast one for most cases (what we are looking
      to implement), and one for their 'fuzzy matching' feature.

      I have a couple questions to start things off. First: I couldn't see
      much need for 'fuzzy matching' in Vim, but some of you are probably
      much better acquainted with regexp use cases than I am. Would this be
      a useful feature to have available? Second: We might have to do some
      gymnastics to work with multibyte characters, as discussed here: <
      http://tech.groups.yahoo.com/group/vimdev/message/46408>. I haven't
      worked with multibyte characters before, so I'm not clear on the
      subtleties. Would this translation to wide characters before passing
      to the engine cause much of a performance hit and/or be excessively
      complicated to implement? On a side note, TRE's main page says it has
      both wide character and multibyte character support. I couldn't find a
      version history, so I'm not sure if this is a new feature that Nikolai
      isn't aware of, or if we need something more.

      I'm interested to hear what you all have to say. We don't need to make
      this decision until middle of next week at the earliest, but I thought
      I would get the discussion going now.

      Ian
    • Brian Gupta
      I have also heard good things about the PCRE (Perl Compatible Regex Library). You may want to consider it as an option. http://www.pcre.org/ -Brian
      Message 2 of 7 , May 31, 2007
      • 0 Attachment
        I have also heard good things about the PCRE (Perl Compatible Regex
        Library). You may want to consider it as an option.

        http://www.pcre.org/

        -Brian

        On 5/31/07, Ian Young <ian.greenleaf@...> wrote:
        > Hi all,
        > I'm Ian, one of the two students working on improving the regexp
        > engine in Vim for this year's Google Summer of Code. I haven't had a
        > whole lot to contribute as of yet, but now that work is underway, I'll
        > probably pop up here asking lots of questions some days.
        >
        > Right now we're working on getting things set up and building a
        > testing suite, but I thought I would spark some discussion on a design
        > decision that will be coming up after we finish this phase, which is
        > whether to implement the new model ourselves, or use an alternative
        > engine, like TRE: <http://laurikari.net/tre/>. I'm tempted to
        > implement one ourselves, as it's an intellectually stimulating
        > prospect, but that doesn't mean I won't listen to reason if TRE or
        > another option is far better. I don't know much about the internals of
        > TRE, but according to previous posts to this list, it utilizes three
        > engines: a slow one for handling backreferences (presumably similar to
        > Vim's current engine), a fast one for most cases (what we are looking
        > to implement), and one for their 'fuzzy matching' feature.
        >
        > I have a couple questions to start things off. First: I couldn't see
        > much need for 'fuzzy matching' in Vim, but some of you are probably
        > much better acquainted with regexp use cases than I am. Would this be
        > a useful feature to have available? Second: We might have to do some
        > gymnastics to work with multibyte characters, as discussed here: <
        > http://tech.groups.yahoo.com/group/vimdev/message/46408>. I haven't
        > worked with multibyte characters before, so I'm not clear on the
        > subtleties. Would this translation to wide characters before passing
        > to the engine cause much of a performance hit and/or be excessively
        > complicated to implement? On a side note, TRE's main page says it has
        > both wide character and multibyte character support. I couldn't find a
        > version history, so I'm not sure if this is a new feature that Nikolai
        > isn't aware of, or if we need something more.
        >
        > I'm interested to hear what you all have to say. We don't need to make
        > this decision until middle of next week at the earliest, but I thought
        > I would get the discussion going now.
        >
        > Ian
        >
      • Nikolai Weibull
        ... PCRE is crap. It is crap, because it uses the same, crappy, backtracking method that Vim, and most other crappy regex (note: not regular expression)
        Message 3 of 7 , May 31, 2007
        • 0 Attachment
          On 5/31/07, Brian Gupta <brian.gupta@...> wrote:
          > I have also heard good things about the PCRE (Perl Compatible Regex
          > Library). You may want to consider it as an option.

          PCRE is crap.

          It is crap, because it uses the same, crappy, backtracking method that
          Vim, and most other crappy regex (note: not regular expression)
          libraries use, which is exactly the kind of crap that this GSoC
          project is aiming to scrap.

          nikocrap
        • Nikolai Weibull
          ... It supports * Byte matching, that is, raw bytes * Wide characters, that is, whatever wchar_t is * Multi-byte characters, thas is, whatever mbrtowc supports
          Message 4 of 7 , May 31, 2007
          • 0 Attachment
            On 5/31/07, Ian Young <ian.greenleaf@...> wrote:

            > I'm Ian, one of the two students working on improving the regexp
            > engine in Vim for this year's Google Summer of Code. I haven't had a
            > whole lot to contribute as of yet, but now that work is underway, I'll
            > probably pop up here asking lots of questions some days.
            >
            > Right now we're working on getting things set up and building a
            > testing suite, but I thought I would spark some discussion on a design
            > decision that will be coming up after we finish this phase, which is
            > whether to implement the new model ourselves, or use an alternative
            > engine, like TRE: <http://laurikari.net/tre/>. I'm tempted to
            > implement one ourselves, as it's an intellectually stimulating
            > prospect, but that doesn't mean I won't listen to reason if TRE or
            > another option is far better. I don't know much about the internals of
            > TRE, but according to previous posts to this list, it utilizes three
            > engines: a slow one for handling backreferences (presumably similar to
            > Vim's current engine), a fast one for most cases (what we are looking
            > to implement), and one for their 'fuzzy matching' feature.
            >
            > I have a couple questions to start things off. First: I couldn't see
            > much need for 'fuzzy matching' in Vim, but some of you are probably
            > much better acquainted with regexp use cases than I am. Would this be
            > a useful feature to have available?

            > Second: We might have to do some
            > gymnastics to work with multibyte characters, as discussed here: <
            > http://tech.groups.yahoo.com/group/vimdev/message/46408>. I haven't
            > worked with multibyte characters before, so I'm not clear on the
            > subtleties. Would this translation to wide characters before passing
            > to the engine cause much of a performance hit and/or be excessively
            > complicated to implement? On a side note, TRE's main page says it has
            > both wide character and multibyte character support. I couldn't find a
            > version history, so I'm not sure if this is a new feature that Nikolai
            > isn't aware of, or if we need something more.

            It supports

            * Byte matching, that is, raw bytes
            * Wide characters, that is, whatever wchar_t is
            * Multi-byte characters, thas is, whatever mbrtowc supports
            * Streams that is, objects that feed TRE characters as it needs them

            It would be pretty easy to set up a stream object that would feed TRE
            characters. It would only have to keep track of where it was in the
            buffer and basically request more of the buffer as TRE needs it.

            It should be noted that there are quite a few bugs in TRE that relate
            to the interaction of quantifiers. I have discussed this privately
            with Ville, but neither of us has been able to resolve it. It has
            also been discussed here:

            http://laurikari.net/pipermail/tre-general/2007-February/thread.html

            where Chris Kuklewicz suggests a solution to the problem that seems to
            work. It is a somewhat costly solution, but it may be worth it in all
            its simplicity. Chris has written an implementation of TDFAs for
            Haskell that is quite simple and manages to both outperform all other
            regex libraries for Haskell and still pass all POSIX tests. Here's
            the announcement:

            http://www.mail-archive.com/glasgow-haskell-users@.../msg11442.html

            This will, sadly, be of no use to us, but it does show that TDFAs are
            a possibility, and that the problems TRE has with quantifiers can be
            resolved.

            Anyway, fuzzy matching, it seems like this is a feature that never
            really caught on. Agrep has long enjoyed the status of being one of
            the few commands that remain to be implemented for the GNU project
            (can't seem to find the list right now, so I can't provide a link).
            This does, however, seem to indicate that no one has cared enough
            about it to implement and distribute it with GNU. It can be a quite
            interesting thing to have, but it's perhaps not useful enough to care
            about at this stage.

            Also, you won't have time to implement this yourself. Seriously. It
            takes a lot of work to write an efficient and
            as-compatible-as-possible implementation implementation and a summer
            isn't nearly enough time to complete said work. I think that what's
            most important here is to set up a test suite and the code required to
            interface with a library, such as TRE. That way one can always hook
            in another library when it gets written.

            Finally, good to hear from you. I think we all look forward to being
            able to enjoy the fruits of your hard labor ;-).

            nikolai
          • Charles E Campbell Jr
            ... As you likely know, fuzzy matching hasn t been available in Vim. One place it has been useful is in suggesting spelling corrections; I myself used agrep
            Message 5 of 7 , May 31, 2007
            • 0 Attachment
              Ian Young wrote:

              > I have a couple questions to start things off. First: I couldn't see
              > much need for 'fuzzy matching' in Vim, but some of you are probably
              > much better acquainted with regexp use cases than I am. Would this be
              > a useful feature to have available?

              As you likely know, fuzzy matching hasn't been available in Vim. One place
              it has been useful is in suggesting spelling corrections; I myself used
              agrep
              in the engspchk.vim plugin to support fuzzy matching.

              Bram already has a spelling error suggestion feature, so I have no idea
              if the
              fuzzy regex would help with it or not.

              What I think could be more useful would be boolean logic for regexp. My
              LogiPat
              plugin provides this capability, but undoubtedly it'd be better if
              somehow it could be
              incorporated. The resulting patterns from LogiPat seem to me to be
              somewhat opaque.

              Regards,
              Chip Campbell
            • Nikolai Weibull
              ... What would be even cooler would be to use regular relations, as that would allow for far superior substitution possibilities to what ... I ve long
              Message 6 of 7 , May 31, 2007
              • 0 Attachment
                On 5/31/07, Charles E Campbell Jr <drchip@...> wrote:

                > What I think could be more useful would be boolean logic for regexp. My
                > LogiPat
                > plugin provides this capability, but undoubtedly it'd be better if
                > somehow it could be
                > incorporated. The resulting patterns from LogiPat seem to me to be
                > somewhat opaque.

                What would be even cooler would be to use regular relations, as that
                would allow for far superior substitution possibilities to what
                :substitute has to offer.

                I've long considered writing a text editor around regular relations,
                and was actually hoping to get a Ph.D. based on using regular
                relations in interactive processes, but that sadly never happened.

                nikolai
              • Nikolai Weibull
                ... (Someone asked off-list what regular relations were. If anyone else is interested, here s what I responded with.) Here are some papers on regular
                Message 7 of 7 , May 31, 2007
                • 0 Attachment
                  On 5/31/07, Nikolai Weibull <now@...> wrote:

                  > What would be even cooler would be to use regular relations, as that
                  > would allow for far superior substitution possibilities to what
                  > :substitute has to offer.

                  (Someone asked off-list what regular relations were. If anyone else
                  is interested, here's what I responded with.)

                  Here are some papers on regular relations:

                  http://citeseer.comp.nus.edu.sg/karttunen95replace.html
                  http://citeseer.comp.nus.edu.sg/karttunen96regular.html

                  Also see

                  http://www.xrce.xerox.com/competencies/content-analysis/fst/home.en.html

                  nikolai

                  P.S.
                  Please don't top-post.
                  D.S.
                Your message has been successfully submitted and would be delivered to recipients shortly.