Loading ...
Sorry, an error occurred while loading the content.

RE: [antlr-interest] Help Lexer rules & states.

Expand Messages
  • Millaway, John
    ... Coming from a flex background, I thought the multiple lexer solution sounded too complex until I started using it. It s actually quite simple. If your
    Message 1 of 6 , Aug 8 8:53 AM
    • 0 Attachment
      > The problem with this approach is that it only works for tags
      > that want
      > to return a
      > single value. I read else where that the Lex state mechanism can be
      > achieved by
      > creating multiple Lexers and switching to the different
      > lexers as tokens
      > are being
      > recognized.
      > This seems a little extreme and complicated especially if
      > multiple unique
      > tokens share the same sub tokens. A lot of duplicate work. Am
      > I missing
      > something
      > here ??
      >

      Coming from a flex background, I thought the "multiple lexer" solution
      sounded too complex until I started using it. It's actually quite simple. If
      your multiple lexers need to share common tokens, then inherit them from a
      base class.

      EXAMPLE:
      This is an example of a parser that switches between two lexers for two
      different types of tags. The lexers both inherit from the same base class so
      they can share some rules.


      class MyParser extends Parser;
      options{ exportVocab=foo;}
      file:
      ( normalTag | specialTag )+
      ;

      normalTag:
      NORMAL_TAG_START (ATTR_NAME EQUALS ATTR_VALUE)+ NORMAL_TAG_END;

      // If we see a special tag, switch lexers.
      specialTag:
      SPECIAL_TAG_START { selector.push("SpecialTagLexer"); }
      BGCOLOR EQUALS ATTR_VALUE TAG SPECIAL_TAG_END;


      class BaseLexer extends Lexer;
      options{ importVocab=foo;}
      protected ATTR_NAME: TEXT ;
      protected ATTR_VALUE: '"' TEXT '"';
      protected EQUALS: '=';

      class NormalTagLexer extends BaseLexer;
      options{ importVocab=foo;}
      NORMAL_TAG_START: "<normal" ;
      SPECIAL_TAG_START: "<special";
      NORMAL_TAG_END: '>';


      class SpecialTagLexer extends BaseLexer;
      options{ importVocab=foo;}

      BGCOLOR: "bgcolor" EQUALS ATTR_VALUE;

      SPECIAL_TAG_END:
      '>' { selector.pop(); }
      ;
    • Sinan
      ... [....] ... ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Shouldn t this be done in the first lexer, so that k-token lookahead will not mess things up????? Sinan
      Message 2 of 6 , Aug 8 3:35 PM
      • 0 Attachment
        "Millaway, John" wrote:
        >
        > > The problem with this approach is that it only works for tags
        > > that want
        > > to return a
        > > single value. I read else where that the Lex state mechanism can be
        > > achieved by
        > > creating multiple Lexers and switching to the different
        > > lexers as tokens
        > > are being
        > > recognized.
        > > This seems a little extreme and complicated especially if
        > > multiple unique
        > > tokens share the same sub tokens. A lot of duplicate work. Am
        > > I missing
        > > something
        > > here ??
        > >
        >
        > Coming from a flex background, I thought the "multiple lexer" solution
        > sounded too complex until I started using it. It's actually quite simple. If
        > your multiple lexers need to share common tokens, then inherit them from a
        > base class.
        >
        [....]
        >
        > class MyParser extends Parser;
        > options{ exportVocab=foo;}
        > file:
        > ( normalTag | specialTag )+
        > ;
        >
        > normalTag:
        > NORMAL_TAG_START (ATTR_NAME EQUALS ATTR_VALUE)+ NORMAL_TAG_END;
        >
        > // If we see a special tag, switch lexers.
        > specialTag:
        > SPECIAL_TAG_START { selector.push("SpecialTagLexer"); }
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        Shouldn't this be done in the first lexer, so that k-token
        lookahead will not mess things up?????


        Sinan
      • Millaway, John
        ... Sure, then the multiple lexers are transparent to the parser, which is a cleaner approach. But you *could* choose to let the parser operate the selector.
        Message 3 of 6 , Aug 8 3:56 PM
        • 0 Attachment
          > > class MyParser extends Parser;
          > > options{ exportVocab=foo;}
          > > file:
          > > ( normalTag | specialTag )+
          > > ;
          > >
          > > normalTag:
          > > NORMAL_TAG_START (ATTR_NAME EQUALS ATTR_VALUE)+ NORMAL_TAG_END;
          > >
          > > // If we see a special tag, switch lexers.
          > > specialTag:
          > > SPECIAL_TAG_START { selector.push("SpecialTagLexer"); }
          > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
          > Shouldn't this be done in the first lexer, so that k-token
          > lookahead will not mess things up?????
          >

          Sure, then the multiple lexers are transparent to the parser, which is a
          cleaner approach. But you *could* choose to let the parser operate the
          selector. The parser might have a better clue when to switch lexers. I was
          just giving an example from the top of my head, which proves to have been
          buggy (my head, I mean)!

          -John
        • mzukowski@bco.com
          ... In general, you *can t* choose to let the parser operate the selector for the lexer. The parser is always at least k tokens ahead of the lexer and is
          Message 4 of 6 , Aug 9 6:44 AM
          • 0 Attachment
            RE: [antlr-interest] Help Lexer rules & states.

            > -----Original Message-----
            > From: Millaway, John [mailto:john@...]
            > Sent: Tuesday, August 08, 2000 3:56 PM
            > To: antlr-interest@egroups.com
            > Subject: RE: [antlr-interest] Help Lexer rules & states.
            >
            >
            > > > class MyParser extends Parser;
            > > > options{ exportVocab=foo;}
            > > > file:
            > > >     (  normalTag |   specialTag )+
            > > >    ;
            > > >
            > > > normalTag:
            > > >     NORMAL_TAG_START (ATTR_NAME EQUALS ATTR_VALUE)+
            > NORMAL_TAG_END;
            > > >
            > > > // If we see a special tag, switch lexers.
            > > > specialTag:
            > > >     SPECIAL_TAG_START { selector.push("SpecialTagLexer"); }
            > >                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
            > > Shouldn't this be done in the first lexer, so that k-token
            > > lookahead will not mess things up?????
            > >
            >
            > Sure,  then the multiple lexers are transparent to the
            > parser, which is a
            > cleaner approach. But you *could* choose to let the parser operate the
            > selector. The parser might have a better clue when to switch
            > lexers. I was
            > just giving an example from the top of my head, which proves
            > to have been
            > buggy (my head, I mean)!

            In general, you *can't* choose to let the parser operate the selector for the lexer.  The parser is always at least k tokens ahead of the lexer and is possibly already to the end of the file due to "infinite lookahead" aka syntactic predicates.  Your lexer will be switched too late, it will already be >=k tokens ahead of where you want the switch to happen.

            The general rule in antlr is that you can't have any feedback from the parser to the lexer.  It is almost guaranteed to not work.

            Monty

          • M. Ranganathan
            I think one should be wary of switching lexers in the parser and avoid it as much as possible. I find that the following rules of thumb are useful : (1) set
            Message 5 of 6 , Aug 9 10:50 AM
            • 0 Attachment
              I think one should be wary of switching lexers in the parser and avoid it as
              much as possible. I find that the following rules of thumb are useful :

              (1) set lokahead (k) == 1 The situation could get problematic when there are
              k tokens. I like to keep k == 1 and handle the difficult cases "by hand".

              (2) It appears to be fairly safe just after a token but is probably not safe
              if you are switching just after a non-terminal. The problem is that the
              lookahead mechanism uses the previous lexical analyzer to determine what
              production to pick.

              In my parser (I have over 10 lexers) I switch after a non-terminal in the
              parser and so far nothing has broken... (cross my fingers!). It seems to
              work so far.

              Another approach is to avoid using syntactic lookahead and just semantic
              lookahead (i.e. do some "by hand " looking ahead in the entry section of the
              rule and select the lexical analyzer for the rule. This could be a gotcha if
              you nest semantic and syntactic lookaheads.

              Any refinements or counter examples to the above heuristics (would be
              welcome).

              Ranga.

              ----
              M. Ranganathan
              NIST Advanced Networking Technologies Group
              100 Bureau Drive, Stop 8290
              Gaithersburg, MD 20899, U.S.A.
              mailto:mranga@...
              Tel: 301 975 3664 Fax: 301 590 0932


              -----Original Message-----
              From: Millaway, John [mailto:john@...]
              Sent: Tuesday, August 08, 2000 3:56 PM
              To: antlr-interest@egroups.com
              Subject: RE: [antlr-interest] Help Lexer rules & states.


              > > class MyParser extends Parser;
              > > options{ exportVocab=foo;}
              > > file:
              > > ( normalTag | specialTag )+
              > > ;
              > >
              > > normalTag:
              > > NORMAL_TAG_START (ATTR_NAME EQUALS ATTR_VALUE)+ NORMAL_TAG_END;
              > >
              > > // If we see a special tag, switch lexers.
              > > specialTag:
              > > SPECIAL_TAG_START { selector.push("SpecialTagLexer"); }
              > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
              > Shouldn't this be done in the first lexer, so that k-token
              > lookahead will not mess things up?????
              >

              Sure, then the multiple lexers are transparent to the parser, which is a
              cleaner approach. But you *could* choose to let the parser operate the
              selector. The parser might have a better clue when to switch lexers. I was
              just giving an example from the top of my head, which proves to have been
              buggy (my head, I mean)!

              -John
            • M. Ranganathan
              RE: [antlr-interest] Help Lexer rules & states.Monty, In general, I agree that feedback from parser to lexer is a shaky proposition but it can be made to work
              Message 6 of 6 , Aug 11 11:59 AM
              • 0 Attachment
                RE: [antlr-interest] Help Lexer rules & states.
                Monty,
                 
                In general, I agree that feedback from parser to lexer is a shaky proposition but it can be made to work for restricted cases. For example, if you have lexers L1 and L2 and you have a common token T, and you are using L1 while parsing in some rule and hit the token T then you can switch to L2 safely.  i.e. assume you are using L1 for the rule:
                 
                rule:
                    nonterminal1 T { Main.selector.select("L2Lexer"); } nonterminal2
                ;
                 
                is safe provided T is in both L1 and L2
                 
                Now there is a significant advantage to doing things this way as you can select the rule in which you want to make a switch. An alternative would be to use another Lexer L3 which is essentially the same as L1 but has the switch coded into the lexer when you see T. However, this approach results in lots of lexers. I have 10 in my grammar and I have a hard time tracking things.
                 
                A gotcha is if you want to select a lexer when you have alterative nonterminals to expand.
                Here, an  approach I have used  when there are alternatives is to pick the lexer on entry to the alternative,  by doing some looking ahead "by hand".
                 
                ip_address_or_hostname
                {
                    boolean ip;
                    if ( LA(1) == DIGIT ) {
                            ip =  true;
                             Main.selector.select("IPAddressLexer");
                    } else {
                            Main.selector.select("HostNameLexer");
                            ip = false;
                    }
                }
                : { ip }? ip_address
                | hostname
                ;
                 
                OK determining whether its an ip address or hostname is more complex than that  - I have simplified, but you get the idea.... However, there is a gotcha here as well - the gotcha here is you better not enclose the rule in an enclosing syntactic lookahead or things seem to break. After much heartbreak, I decided to use syntactic lookahead sparingly.  Another gotcha is that this will break if you use k > 1. Solution ... I use k == 1 and I select alternatives exclusively using this technique when there are ambiguities. ( Pay careful attention to the follow sets and make sure that you have not enclosed this rule in an enclosing syntactic lookahead and you are safe....: )
                 
                Provided you are willing to some hackery like the above, I believe you can completely avoid syntactic lookahead in your grammar and rely instead on localized "by hand" semantic lookahead.
                 
                Bottom line -- my experience is that in general switching lexers from the parser is to be avoided but it can be done and is very useful in limited cases. 
                 
                Now some words of praise: Antlr is groovy! An appropriate tool for dealing with composite, imbedded, languages. I am using it to implement an interpreter for message headers for a new IETF signaling protocol that imbeds headers from other protocols and has context dependent reserved keywords.  I found no other parser generator tool that offers me the flexibility of antlr. Indeed all other implementations I have seen parse such headers "by hand" and I was quite determined to use formal parsing techniques as it is much easer to have an assurance of correctness.
                 
                 I thank you, Ter and the other major contributors for your efforts.
                 
                Sincerely
                 
                Ranga.

                ----
                M. Ranganathan
                NIST Advanced Networking Technologies Group
                100 Bureau Drive, Stop 8290
                Gaithersburg, MD 20899, U.S.A.
                mailto:mranga@...
                Tel: 301 975 3664 Fax: 301 590 0932

                 
                 
                -----Original Message-----
                From: mzukowski@... [mailto:mzukowski@...]
                Sent: Wednesday, August 09, 2000 6:45 AM
                To: antlr-interest@egroups.com
                Subject: RE: [antlr-interest] Help Lexer rules & states.



                > -----Original Message-----
                > From: Millaway, John [mailto:john@...]
                > Sent: Tuesday, August 08, 2000 3:56 PM
                > To: antlr-interest@egroups.com
                > Subject: RE: [antlr-interest] Help Lexer rules & states.
                >
                >
                > > > class MyParser extends Parser;
                > > > options{ exportVocab=foo;}
                > > > file:
                > > >     (  normalTag |   specialTag )+
                > > >    ;
                > > >
                > > > normalTag:
                > > >     NORMAL_TAG_START (ATTR_NAME EQUALS ATTR_VALUE)+
                > NORMAL_TAG_END;
                > > >
                > > > // If we see a special tag, switch lexers.
                > > > specialTag:
                > >

                >     SPECIAL_TAG_START {
                selector.push("SpecialTagLexer"); }
                >
                >                          
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

                > > Shouldn't this be done in the first lexer, so that k-token
                > > lookahead will not mess things up?????
                > >
                >
                > Sure,  then the multiple lexers are transparent to the
                > parser, which is a
                > cleaner approach. But you *could* choose to let the parser operate the
                > selector. The parser might have a better clue when to switch
                > lexers. I was
                > just giving an example from the top of my head, which proves
                > to have been
                > buggy (my head, I mean)!

                In general, you *can't* choose to let the parser operate the selector for the lexer.  The parser is always at least k tokens ahead of the lexer and is possibly already to the end of the file due to "infinite lookahead" aka syntactic predicates.  Your lexer will be switched too late, it will already be >=k tokens ahead of where you want the switch to happen.

                The general rule in antlr is that you can't have any feedback from the parser to the lexer.  It is almost guaranteed to not work.

                Monty

              Your message has been successfully submitted and would be delivered to recipients shortly.