Loading ...
Sorry, an error occurred while loading the content.

Re: Recursive-descent lexer question.

Expand Messages
  • John B. Brodie
    Mr. Richter :- ... Consider the scanning of this string of two characters: r n Under the above rules, in particular because of the required 1-or-more
    Message 1 of 5 , Nov 9, 1999
    • 0 Attachment
      Mr. Richter :-

      You asked about resolving an ambiguity in the following:

      > Whitespace
      > : ( ' ' | '\t' | '\f' | Newline )+
      > { $setType(Token.SKIP); }
      > ;
      >
      > Newline
      > : ( "\r\n" | '\r' | '\n' )
      > { newline(); }
      > ;

      Consider the scanning of this string of two characters: "\r\n"

      Under the above rules, in particular because of the required 1-or-more
      repitition in the Whitespace rule, there are two derivations possible:

      Whitespace ==> Newline
      ==> "\r\n"

      or

      Whitespace ==> Whitespace Whitespace
      ==> Newline Whitespace
      ==> '\r' Whitespace
      ==> '\r' Newline
      ==> '\r' '\n'

      I think you should left factor the two Newline sub-rules involving the
      '\r' character thusly:

      Newline
      : ( '\r' ( '\n' )? | '\n' )
      { newline(); }
      ;

      Note that I have not actually tried this, so it is probably bogus.

      Hope this helps...

      -jbb
      --------------------+------------------------
      John B. Brodie | Email : jbb@...
      --------------------+------------------------
    • Doug Erickson
      If possible, apply the protected modifier to your Newline rule. What this means is that the next token logic will not consider Newline an alternative when
      Message 2 of 5 , Nov 9, 1999
      • 0 Attachment
        If possible, apply the "protected" modifier to your Newline rule.

        What this means is that the "next token" logic will not consider Newline
        an alternative when trying to decide what the next token is. However,
        the lexer will contain a Newline rule to match those characters that can
        be called from non-protected rules like Whitespace or Comment.

        As long as you don't care about Newline tokens up in your parser, this
        works great.

        "Michael T. Richter" wrote:
        >
        > From: "Michael T. Richter" <mtr@...>
        >
        > I thought that part of the whole point of the new lexer was to make a lexer
        > with a language that is very similar to the parser -- in effect to "parse"
        > a set of lexemes. I seem to be missing something (probably very obvious)
        > however.
        >
        > The issue is this: I have a rule for whitespace which, of course, handles
        > newlines by counting lines. So far so good. Unfortunately the ASN.1 spec
        > also mandates comments which are opened with "--" and which may be
        > terminated by newlines or by another "--". The obvious thing for me to do,
        > or so I thought, was to do something like this:
        >
        > Whitespace
        > : ( ' ' | '\t' | '\f' | Newline )+
        > { $setType(Token.SKIP); }
        > ;
        >
        > Newline
        > : ( "\r\n" | '\r' | '\n' )
        > { newline(); }
        > ;
        >
        > In this way I can put the Newline definition in my Comment rule (as yet
        > undefined) as well.
        >
        > As seems to be my usual, my instincts are wrong. Doing a quick test
        > compile of a grammar containing the above gives me this:
        >
        > warning: lexical nondeterminism between rules Whitespace and Newline upon
        > k==1:'\n','\r'
        > k==2:<end-of-token>,'\n'
        >
        > Apparently ANTLR is using Newline as a root rule as well as something
        > called from within Whitespace.
        >
        > What very obvious thing am I missing? Does it have something to do with
        > the tokens{} section of the grammar (something I don't quite grok the point
        > of yet)?
        >
        > --
        > Michael T. Richter <mtr@...> http://www.igs.net/~mtr/
        > PGP Key: http://www.igs.net/~mtr/pgp-key.html
        > PGP Fingerprint: 40D1 33E0 F70B 6BB5 8353 4669 B4CC DD09 04ED 4FE8
        >
        >
      • Michael T. Richter
        ... Thanks. That did it. ... I m a little unclear about the construction of the lexer. (Time to look at the source code, methinks.) The nextToken method
        Message 3 of 5 , Nov 9, 1999
        • 0 Attachment
          At 11:24 AM 11/9/99 , you wrote:
          > If possible, apply the "protected" modifier to your Newline rule.

          Thanks. That did it.

          >What this means is that the "next token" logic will not consider Newline
          >an alternative when trying to decide what the next token is. However,
          >the lexer will contain a Newline rule to match those characters that can
          >be called from non-protected rules like Whitespace or Comment.

          I'm a little unclear about the construction of the lexer. (Time to look at
          the source code, methinks.) The "nextToken" method seems to be the big
          difference between the lexer generator and the parser generator, though.

          >As long as you don't care about Newline tokens up in your parser, this
          >works great.

          I don't care about newlines, comments or whitespace up in parser-land,
          except insofar as these three items separate tokens. Your tip did the
          trick. Thanks again.

          --
          Michael T. Richter <mtr@...> http://www.igs.net/~mtr/
          PGP Key: http://www.igs.net/~mtr/pgp-key.html
          PGP Fingerprint: 40D1 33E0 F70B 6BB5 8353 4669 B4CC DD09 04ED 4FE8
        • Michael T. Richter
          ... It didn t work directly out-of-the-box , but it gave me a new line of attack. This new line of attack resulted in a newer, probably much more efficient
          Message 4 of 5 , Nov 9, 1999
          • 0 Attachment
            At 11:20 AM 11/9/99 , you wrote:
            >I think you should left factor the two Newline sub-rules involving the
            >'\r' character thusly:

            > Newline
            > : ( '\r' ( '\n' )? | '\n' )
            > { newline(); }
            > ;

            >Note that I have not actually tried this, so it is probably bogus.

            >Hope this helps...

            It didn't work directly "out-of-the-box", but it gave me a new line of
            attack. This new line of attack resulted in a newer, probably much more
            efficient (and certainly more readable) rule. Thanks for the tip.

            --
            Michael T. Richter <mtr@...> http://www.igs.net/~mtr/
            PGP Key: http://www.igs.net/~mtr/pgp-key.html
            PGP Fingerprint: 40D1 33E0 F70B 6BB5 8353 4669 B4CC DD09 04ED 4FE8
          Your message has been successfully submitted and would be delivered to recipients shortly.