Loading ...
Sorry, an error occurred while loading the content.
 

Re: [antlr-interest] lexical nondeterminism between IDENT & LABEL

Expand Messages
  • Paul J. Lucas
    ... What s wrong with using a syntaxtic predicate in the lexer? Just because a label happens to have the same character pattern as an identifier doesn t mean
    Message 1 of 16 , Nov 3, 2004
      On Wed, 3 Nov 2004, John D. Mitchell wrote:

      > >>>>> "thoth2487" == thoth2487 <thoth2487@...> writes:
      >
      > > Hi to all, I've a very simple language in which there are IDENTifiers and
      > > jump LABELs. An IDENTifier start with ('a'...'z')|('A'..'Z') and continue
      > > with ('a'...'z')|('A'..'Z')|('0'..'9') and a LABEL is like an IDENTifier
      > > but ends with a ':'. When I try following .g I obtain always a lexical
      > > nondeterminsim which I isn't to solve:
      >
      > Stop trying to do that in the lexer. Let the lexer return the ID for both
      > and then have your parsing rules distinguish between ID ":" being a label
      > or the ID is just an ID.

      What's wrong with using a syntaxtic predicate in the lexer?
      Just because a label happens to have the same character pattern
      as an identifier doesn't mean it's conceptually the same kind
      of token.

      Since ANTLR has a much more powerful lexer than most, why not
      take advantage of it?

      - Paul
    • Monty Zukowski
      ... Factor it into only one rule: // identifier or // label IDENT ... ( a .. z | A .. Z | 0 .. 9 )* : {$setType(LABEL);} ; be sure to add LABEL to your
      Message 2 of 16 , Nov 3, 2004
        On Nov 3, 2004, at 4:57 AM, thoth2487 wrote:

        >
        > // identifier
        > IDENT
        > : ('a'..'z' | 'A'..'Z')
        > ('a'..'z' | 'A'..'Z' | '0'..'9')*
        > ;
        >
        > // label
        > LABEL
        > : ('a'..'z' | 'A'..'Z')
        > ('a'..'z' | 'A'..'Z' | '0'..'9')*
        > ':'
        > ;

        Factor it into only one rule:

        // identifier or
        // label
        IDENT
        : ('a'..'z' | 'A'..'Z')
        ('a'..'z' | 'A'..'Z' | '0'..'9')*
        ':' {$setType(LABEL);}
        ;

        be sure to add LABEL to your tokens{} section

        Monty
      • Paul J. Lucas
        ... Shouldn t that be: IDENT ... ( a .. z | A .. Z | 0 .. 9 )* ( : {$setType(LABEL);} )? ; ? - Paul
        Message 3 of 16 , Nov 3, 2004
          On Wed, 3 Nov 2004, Monty Zukowski wrote:

          > IDENT
          > : ('a'..'z' | 'A'..'Z')
          > ('a'..'z' | 'A'..'Z' | '0'..'9')*
          > ':' {$setType(LABEL);}
          > ;

          Shouldn't that be:


          IDENT
          : ('a'..'z' | 'A'..'Z')
          ('a'..'z' | 'A'..'Z' | '0'..'9')*
          (':' {$setType(LABEL);} )?
          ;

          ?

          - Paul
        • Monty Zukowski
          ... Absolutely! Thanks, Monty
          Message 4 of 16 , Nov 3, 2004
            On Nov 3, 2004, at 11:54 AM, Paul J. Lucas wrote:

            >
            > On Wed, 3 Nov 2004, Monty Zukowski wrote:
            >
            >> IDENT
            >> : ('a'..'z' | 'A'..'Z')
            >> ('a'..'z' | 'A'..'Z' | '0'..'9')*
            >> ':' {$setType(LABEL);}
            >> ;
            >
            > Shouldn't that be:
            >
            >
            > IDENT
            > : ('a'..'z' | 'A'..'Z')
            > ('a'..'z' | 'A'..'Z' | '0'..'9')*
            > (':' {$setType(LABEL);} )?
            > ;
            >
            > ?

            Absolutely!

            Thanks,

            Monty
          • Monty Zukowski
            ... Some languages like AREV are made much simpler if the lexer can distinguish labels from IDs, that was the whole inspiration for my ParserFilter example on
            Message 5 of 16 , Nov 3, 2004
              On Nov 3, 2004, at 9:49 AM, John D. Mitchell wrote:

              >
              >>>>>> "thoth2487" == thoth2487 <thoth2487@...> writes:
              > [...]
              >
              >> Hi to all, I've a very simple language in which there are IDENTifiers
              >> and
              >> jump LABELs. An IDENTifier start with ('a'...'z')|('A'..'Z') and
              >> continue
              >> with ('a'...'z')|('A'..'Z')|('0'..'9') and a LABEL is like an
              >> IDENTifier
              >> but ends with a ':'. When I try following .g I obtain always a lexical
              >> nondeterminsim which I isn't to solve:
              >
              > Stop trying to do that in the lexer. Let the lexer return the ID for
              > both
              > and then have your parsing rules distinguish between ID ":" being a
              > label
              > or the ID is just an ID.
              >


              Some languages like AREV are made much simpler if the lexer can
              distinguish labels from IDs, that was the whole inspiration for my
              ParserFilter example on my website.

              Monty

              ANTLR & Java Consultant -- http://www.codetransform.com
              ANSI C/GCC transformation toolkit --
              http://www.codetransform.com/gcc.html
              Embrace the Decay -- http://www.codetransform.com/EmbraceDecay.html
            • John D. Mitchell
              ... [...] ... Theoretically? Nothing. ... Indeed. However: (A) Newbies (and even experienced folks :-) too often try to jam way too much into the lexer. This
              Message 6 of 16 , Nov 3, 2004
                >>>>> "Paul" == Paul J Lucas <pauljlucas@...> writes:
                [...]

                > What's wrong with using a syntaxtic predicate in the lexer?

                Theoretically? Nothing.

                > Just because a label happens to have the same character pattern as an
                > identifier doesn't mean it's conceptually the same kind of token.

                Indeed.

                However:

                (A) Newbies (and even experienced folks :-) too often try to jam way too
                much into the lexer. This is a Very Bad Thing(tm) and, IMHO, should be
                generally discouraged.

                (B) A common reason given is that "the language is simple" so just do it in
                the lexer. All too often, that's not the case and semantic context is
                required. (See my comment below).

                (C) When people start their "simple" solutions in the lexer and things get
                wacky, they all too often try to hack things to e.g. push context back into
                the lexer from the parser to "fix" the problem (and that's Pure Evil(tm) :-).

                > Since ANTLR has a much more powerful lexer than most, why not take
                > advantage of it?

                For a complex example of how to deal with sort of confusion in the lexer,
                check out the Number rule near the bottom of the StdC grammar. This is
                dealing with purely syntactic ambiguity because of the many uses of '.'.

                Take care,
                John
              • thoth2487
                ... I ve tried your suggested parser way with: ident: IDENT; label: IDENT COLON ; but in this way a LABEL could be either: MAIN: // right LABEL or MAIN :
                Message 7 of 16 , Nov 3, 2004
                  --- In antlr-interest@yahoogroups.com, "John D. Mitchell"
                  <johnm-antlr@n...> wrote:
                  > Stop trying to do that in the lexer.
                  > Let the lexer return the ID for both and then have your
                  > parsing rules distinguish between ID ":" being a label
                  > or the ID is just an ID.

                  I've tried your suggested parser way with:

                  ident: IDENT;

                  label: IDENT
                  COLON
                  ;

                  but in this way a LABEL could be either:

                  MAIN: // right LABEL
                  or
                  MAIN : // wrong LABEL due space(s)

                  so I need to change WS rule from:
                  WS: (' '|'\t'|'\f') {$setType(Token.SKIP);};

                  to
                  WS: (' '|'\t'|'\f')*;

                  Now parser work fine with 'ident' & 'label'
                  but new WS behaviour make more complex parsers
                  rules which must check always also presence of WS. Eg:

                  conditional:
                  IF
                  (WS)?
                  expression
                  (WS)?
                  goto
                  ........ and so on

                  What I've mistaked ? What do you suggest about ?

                  Thank you very much
                  Silverio Diquigiovanni
                • Paul J. Lucas
                  ... You make the rule include whitespace: ... ( (WS)? : { $setType( LABEL ); } ) - Paul
                  Message 8 of 16 , Nov 3, 2004
                    On Thu, 4 Nov 2004, thoth2487 wrote:

                    > Now parser work fine with 'ident' & 'label'
                    > but new WS behaviour make more complex parsers
                    > rules which must check always also presence of WS. Eg:
                    >
                    > conditional:
                    > IF
                    > (WS)?
                    > expression
                    > (WS)?
                    > goto
                    > ........ and so on
                    >
                    > What I've mistaked ? What do you suggest about ?

                    You make the rule include whitespace:

                    ... ( (WS)? ':' { $setType( LABEL ); } )

                    - Paul
                  • thoth2487
                    ... Can you try to get me a sample of syntaxtic predicate to solve below ident/label lexical nondeterminism ? : INDENT: ( A .. Z | a .. z )*; LABEL:
                    Message 9 of 16 , Nov 4, 2004
                      <pauljlucas@m...> wrote:

                      > What's wrong with using a syntaxtic predicate in the lexer?

                      Can you try to get me a sample of syntaxtic predicate to solve
                      below ident/label lexical nondeterminism ? :

                      INDENT: ('A'..'Z'|'a'..'z')*;
                      LABEL: ('A'..'Z'|'a'..'z')* ':';

                      Thank you very much
                      Silverio Diquigiovanni
                    • Paul J. Lucas
                      ... protected Ident ... ; IDENT ... ; Monty s solution makes the : part of the token; the above doesn t. Hence, the above is cleaner from the parser
                      Message 10 of 16 , Nov 4, 2004
                        On Thu, 4 Nov 2004, thoth2487 wrote:

                        > Can you try to get me a sample of syntaxtic predicate to solve
                        > below ident/label lexical nondeterminism ? :

                        protected Ident
                        : /* fill in the blank */
                        ;

                        IDENT
                        : (Ident (WS)? ':')=> Ident { $setType( LABEL ); }
                        ;

                        Monty's solution makes the ':' part of the token; the above
                        doesn't. Hence, the above is "cleaner" from the parser
                        perspective.

                        - Paul
                      • John D. Mitchell
                        ... [...] ... Well, I don t know what language you re trying to build so it s hard to give you specific advice. If your label construct is truly syntactic then
                        Message 11 of 16 , Nov 4, 2004
                          >>>>> "thoth2487" == thoth2487 <thoth2487@...> writes:
                          [...]

                          > MAIN: // right LABEL or MAIN : // wrong LABEL due space(s)

                          Well, I don't know what language you're trying to build so it's hard to
                          give you specific advice.

                          If your label construct is truly syntactic then using the fixed version of
                          Monty's example of doing it in the lexer is a reasonable approach.

                          However, if the colon is is overloaded (like '.' in the C language) but the
                          ambiguities are all purely syntactic in nature then doing the more
                          complicated factoring as exemplified by the Number rule in the StdC lexer
                          is a reasonable approach.

                          However, if the ambiguities related to ':' in your language require
                          semantic context then how you should resolve it in the parser depends on
                          the semantics of the language. For an example of this, check out the
                          StdCParser.g and look for the rules using COLON.


                          > so I need to change WS rule from: WS: (' '|'\t'|'\f')
                          > {$setType(Token.SKIP);};

                          > to WS: (' '|'\t'|'\f')*;

                          That's pretty much never a good idea for exactly the reason that you
                          discovered.

                          Hope this helps,
                          John
                        • John D. Mitchell
                          ... [...] ... Hmm... I thought the OP said that the LABEL s could NOT have any WS between the ID and the : ? Thanks, John
                          Message 12 of 16 , Nov 4, 2004
                            >>>>> "Paul" == Paul J Lucas <pauljlucas@...> writes:
                            [...]

                            > You make the rule include whitespace:

                            > ... ( (WS)? ':' { $setType( LABEL ); } )

                            Hmm... I thought the OP said that the LABEL's could NOT have any WS between
                            the ID and the ':'?

                            Thanks,
                            John
                          • Paul J. Lucas
                            ... Maybe. I don t remember. I m just going by what a label is in, say, C. The OP is free to delete the (WS)? - Paul
                            Message 13 of 16 , Nov 4, 2004
                              On Thu, 4 Nov 2004, John D. Mitchell wrote:

                              > >>>>> "Paul" == Paul J Lucas <pauljlucas@...> writes:
                              > [...]
                              >
                              > > You make the rule include whitespace:
                              >
                              > > ... ( (WS)? ':' { $setType( LABEL ); } )
                              >
                              > Hmm... I thought the OP said that the LABEL's could NOT have any WS between
                              > the ID and the ':'?

                              Maybe. I don't remember. I'm just going by what a label is in,
                              say, C. The OP is free to delete the "(WS)?"

                              - Paul
                            • Anthony Youngman
                              I don t know the OP s language, but I do know (very well) the AREV class of languages that Monty mentioned. And the original lexer that parsed that language
                              Message 14 of 16 , Nov 5, 2004
                                I don't know the OP's language, but I do know (very well) the AREV class
                                of languages that Monty mentioned.

                                And the original lexer that parsed that language used a "state
                                transition table". It's damn difficult to implement that table in an
                                ANTLR lexer. It's damn difficult to implement that table in an ANTLR
                                parser.

                                But creating a filter between the lexer and parser, who's sole purpose
                                is to provide semantic analysis of the token stream from the lexer
                                before it gets to the parser, makes analysing these languages a doddle.

                                Currently the lexer does both token and semantic analysis. It works a
                                lot of the time. But for some languages its a disaster, and it would be
                                nice to be able to split the two jobs apart.

                                Cheers,
                                Wol

                                -----Original Message-----
                                From: John D. Mitchell [mailto:johnm-antlr@...]
                                Sent: 04 November 2004 17:01
                                To: antlr-interest@yahoogroups.com
                                Subject: [antlr-interest] Re: lexical nondeterminism between IDENT &
                                LABEL


                                Well, I don't know what language you're trying to build so it's hard to
                                give you specific advice.

                                If your label construct is truly syntactic then using the fixed version
                                of
                                Monty's example of doing it in the lexer is a reasonable approach.

                                However, if the colon is is overloaded (like '.' in the C language) but
                                the
                                ambiguities are all purely syntactic in nature then doing the more
                                complicated factoring as exemplified by the Number rule in the StdC
                                lexer
                                is a reasonable approach.

                                However, if the ambiguities related to ':' in your language require
                                semantic context then how you should resolve it in the parser depends on
                                the semantics of the language. For an example of this, check out the
                                StdCParser.g and look for the rules using COLON.



                                ****************************************************************************

                                This transmission is intended for the named recipient only. It may contain private and confidential information. If this has come to you in error you must not act on anything disclosed in it, nor must you copy it, modify it, disseminate it in any way, or show it to anyone. Please e-mail the sender to inform us of the transmission error or telephone ECA International immediately and delete the e-mail from your information system.

                                Telephone numbers for ECA International offices are: Sydney +61 (0)2 8272 5300, Hong Kong + 852 2121 2388, London +44 (0)20 7351 5000 and New York +1 212 582 2333.

                                ****************************************************************************
                              Your message has been successfully submitted and would be delivered to recipients shortly.