Loading ...
Sorry, an error occurred while loading the content.

Re: [antlr-interest] lexical nondeterminism between IDENT & LABEL

Expand Messages
  • Monty Zukowski
    ... Factor it into only one rule: // identifier or // label IDENT ... ( a .. z | A .. Z | 0 .. 9 )* : {$setType(LABEL);} ; be sure to add LABEL to your
    Message 1 of 16 , Nov 3, 2004
    • 0 Attachment
      On Nov 3, 2004, at 4:57 AM, thoth2487 wrote:

      >
      > // identifier
      > IDENT
      > : ('a'..'z' | 'A'..'Z')
      > ('a'..'z' | 'A'..'Z' | '0'..'9')*
      > ;
      >
      > // label
      > LABEL
      > : ('a'..'z' | 'A'..'Z')
      > ('a'..'z' | 'A'..'Z' | '0'..'9')*
      > ':'
      > ;

      Factor it into only one rule:

      // identifier or
      // label
      IDENT
      : ('a'..'z' | 'A'..'Z')
      ('a'..'z' | 'A'..'Z' | '0'..'9')*
      ':' {$setType(LABEL);}
      ;

      be sure to add LABEL to your tokens{} section

      Monty
    • Paul J. Lucas
      ... Shouldn t that be: IDENT ... ( a .. z | A .. Z | 0 .. 9 )* ( : {$setType(LABEL);} )? ; ? - Paul
      Message 2 of 16 , Nov 3, 2004
      • 0 Attachment
        On Wed, 3 Nov 2004, Monty Zukowski wrote:

        > IDENT
        > : ('a'..'z' | 'A'..'Z')
        > ('a'..'z' | 'A'..'Z' | '0'..'9')*
        > ':' {$setType(LABEL);}
        > ;

        Shouldn't that be:


        IDENT
        : ('a'..'z' | 'A'..'Z')
        ('a'..'z' | 'A'..'Z' | '0'..'9')*
        (':' {$setType(LABEL);} )?
        ;

        ?

        - Paul
      • Monty Zukowski
        ... Absolutely! Thanks, Monty
        Message 3 of 16 , Nov 3, 2004
        • 0 Attachment
          On Nov 3, 2004, at 11:54 AM, Paul J. Lucas wrote:

          >
          > On Wed, 3 Nov 2004, Monty Zukowski wrote:
          >
          >> IDENT
          >> : ('a'..'z' | 'A'..'Z')
          >> ('a'..'z' | 'A'..'Z' | '0'..'9')*
          >> ':' {$setType(LABEL);}
          >> ;
          >
          > Shouldn't that be:
          >
          >
          > IDENT
          > : ('a'..'z' | 'A'..'Z')
          > ('a'..'z' | 'A'..'Z' | '0'..'9')*
          > (':' {$setType(LABEL);} )?
          > ;
          >
          > ?

          Absolutely!

          Thanks,

          Monty
        • Monty Zukowski
          ... Some languages like AREV are made much simpler if the lexer can distinguish labels from IDs, that was the whole inspiration for my ParserFilter example on
          Message 4 of 16 , Nov 3, 2004
          • 0 Attachment
            On Nov 3, 2004, at 9:49 AM, John D. Mitchell wrote:

            >
            >>>>>> "thoth2487" == thoth2487 <thoth2487@...> writes:
            > [...]
            >
            >> Hi to all, I've a very simple language in which there are IDENTifiers
            >> and
            >> jump LABELs. An IDENTifier start with ('a'...'z')|('A'..'Z') and
            >> continue
            >> with ('a'...'z')|('A'..'Z')|('0'..'9') and a LABEL is like an
            >> IDENTifier
            >> but ends with a ':'. When I try following .g I obtain always a lexical
            >> nondeterminsim which I isn't to solve:
            >
            > Stop trying to do that in the lexer. Let the lexer return the ID for
            > both
            > and then have your parsing rules distinguish between ID ":" being a
            > label
            > or the ID is just an ID.
            >


            Some languages like AREV are made much simpler if the lexer can
            distinguish labels from IDs, that was the whole inspiration for my
            ParserFilter example on my website.

            Monty

            ANTLR & Java Consultant -- http://www.codetransform.com
            ANSI C/GCC transformation toolkit --
            http://www.codetransform.com/gcc.html
            Embrace the Decay -- http://www.codetransform.com/EmbraceDecay.html
          • John D. Mitchell
            ... [...] ... Theoretically? Nothing. ... Indeed. However: (A) Newbies (and even experienced folks :-) too often try to jam way too much into the lexer. This
            Message 5 of 16 , Nov 3, 2004
            • 0 Attachment
              >>>>> "Paul" == Paul J Lucas <pauljlucas@...> writes:
              [...]

              > What's wrong with using a syntaxtic predicate in the lexer?

              Theoretically? Nothing.

              > Just because a label happens to have the same character pattern as an
              > identifier doesn't mean it's conceptually the same kind of token.

              Indeed.

              However:

              (A) Newbies (and even experienced folks :-) too often try to jam way too
              much into the lexer. This is a Very Bad Thing(tm) and, IMHO, should be
              generally discouraged.

              (B) A common reason given is that "the language is simple" so just do it in
              the lexer. All too often, that's not the case and semantic context is
              required. (See my comment below).

              (C) When people start their "simple" solutions in the lexer and things get
              wacky, they all too often try to hack things to e.g. push context back into
              the lexer from the parser to "fix" the problem (and that's Pure Evil(tm) :-).

              > Since ANTLR has a much more powerful lexer than most, why not take
              > advantage of it?

              For a complex example of how to deal with sort of confusion in the lexer,
              check out the Number rule near the bottom of the StdC grammar. This is
              dealing with purely syntactic ambiguity because of the many uses of '.'.

              Take care,
              John
            • thoth2487
              ... I ve tried your suggested parser way with: ident: IDENT; label: IDENT COLON ; but in this way a LABEL could be either: MAIN: // right LABEL or MAIN :
              Message 6 of 16 , Nov 3, 2004
              • 0 Attachment
                --- In antlr-interest@yahoogroups.com, "John D. Mitchell"
                <johnm-antlr@n...> wrote:
                > Stop trying to do that in the lexer.
                > Let the lexer return the ID for both and then have your
                > parsing rules distinguish between ID ":" being a label
                > or the ID is just an ID.

                I've tried your suggested parser way with:

                ident: IDENT;

                label: IDENT
                COLON
                ;

                but in this way a LABEL could be either:

                MAIN: // right LABEL
                or
                MAIN : // wrong LABEL due space(s)

                so I need to change WS rule from:
                WS: (' '|'\t'|'\f') {$setType(Token.SKIP);};

                to
                WS: (' '|'\t'|'\f')*;

                Now parser work fine with 'ident' & 'label'
                but new WS behaviour make more complex parsers
                rules which must check always also presence of WS. Eg:

                conditional:
                IF
                (WS)?
                expression
                (WS)?
                goto
                ........ and so on

                What I've mistaked ? What do you suggest about ?

                Thank you very much
                Silverio Diquigiovanni
              • Paul J. Lucas
                ... You make the rule include whitespace: ... ( (WS)? : { $setType( LABEL ); } ) - Paul
                Message 7 of 16 , Nov 3, 2004
                • 0 Attachment
                  On Thu, 4 Nov 2004, thoth2487 wrote:

                  > Now parser work fine with 'ident' & 'label'
                  > but new WS behaviour make more complex parsers
                  > rules which must check always also presence of WS. Eg:
                  >
                  > conditional:
                  > IF
                  > (WS)?
                  > expression
                  > (WS)?
                  > goto
                  > ........ and so on
                  >
                  > What I've mistaked ? What do you suggest about ?

                  You make the rule include whitespace:

                  ... ( (WS)? ':' { $setType( LABEL ); } )

                  - Paul
                • thoth2487
                  ... Can you try to get me a sample of syntaxtic predicate to solve below ident/label lexical nondeterminism ? : INDENT: ( A .. Z | a .. z )*; LABEL:
                  Message 8 of 16 , Nov 4, 2004
                  • 0 Attachment
                    <pauljlucas@m...> wrote:

                    > What's wrong with using a syntaxtic predicate in the lexer?

                    Can you try to get me a sample of syntaxtic predicate to solve
                    below ident/label lexical nondeterminism ? :

                    INDENT: ('A'..'Z'|'a'..'z')*;
                    LABEL: ('A'..'Z'|'a'..'z')* ':';

                    Thank you very much
                    Silverio Diquigiovanni
                  • Paul J. Lucas
                    ... protected Ident ... ; IDENT ... ; Monty s solution makes the : part of the token; the above doesn t. Hence, the above is cleaner from the parser
                    Message 9 of 16 , Nov 4, 2004
                    • 0 Attachment
                      On Thu, 4 Nov 2004, thoth2487 wrote:

                      > Can you try to get me a sample of syntaxtic predicate to solve
                      > below ident/label lexical nondeterminism ? :

                      protected Ident
                      : /* fill in the blank */
                      ;

                      IDENT
                      : (Ident (WS)? ':')=> Ident { $setType( LABEL ); }
                      ;

                      Monty's solution makes the ':' part of the token; the above
                      doesn't. Hence, the above is "cleaner" from the parser
                      perspective.

                      - Paul
                    • John D. Mitchell
                      ... [...] ... Well, I don t know what language you re trying to build so it s hard to give you specific advice. If your label construct is truly syntactic then
                      Message 10 of 16 , Nov 4, 2004
                      • 0 Attachment
                        >>>>> "thoth2487" == thoth2487 <thoth2487@...> writes:
                        [...]

                        > MAIN: // right LABEL or MAIN : // wrong LABEL due space(s)

                        Well, I don't know what language you're trying to build so it's hard to
                        give you specific advice.

                        If your label construct is truly syntactic then using the fixed version of
                        Monty's example of doing it in the lexer is a reasonable approach.

                        However, if the colon is is overloaded (like '.' in the C language) but the
                        ambiguities are all purely syntactic in nature then doing the more
                        complicated factoring as exemplified by the Number rule in the StdC lexer
                        is a reasonable approach.

                        However, if the ambiguities related to ':' in your language require
                        semantic context then how you should resolve it in the parser depends on
                        the semantics of the language. For an example of this, check out the
                        StdCParser.g and look for the rules using COLON.


                        > so I need to change WS rule from: WS: (' '|'\t'|'\f')
                        > {$setType(Token.SKIP);};

                        > to WS: (' '|'\t'|'\f')*;

                        That's pretty much never a good idea for exactly the reason that you
                        discovered.

                        Hope this helps,
                        John
                      • John D. Mitchell
                        ... [...] ... Hmm... I thought the OP said that the LABEL s could NOT have any WS between the ID and the : ? Thanks, John
                        Message 11 of 16 , Nov 4, 2004
                        • 0 Attachment
                          >>>>> "Paul" == Paul J Lucas <pauljlucas@...> writes:
                          [...]

                          > You make the rule include whitespace:

                          > ... ( (WS)? ':' { $setType( LABEL ); } )

                          Hmm... I thought the OP said that the LABEL's could NOT have any WS between
                          the ID and the ':'?

                          Thanks,
                          John
                        • Paul J. Lucas
                          ... Maybe. I don t remember. I m just going by what a label is in, say, C. The OP is free to delete the (WS)? - Paul
                          Message 12 of 16 , Nov 4, 2004
                          • 0 Attachment
                            On Thu, 4 Nov 2004, John D. Mitchell wrote:

                            > >>>>> "Paul" == Paul J Lucas <pauljlucas@...> writes:
                            > [...]
                            >
                            > > You make the rule include whitespace:
                            >
                            > > ... ( (WS)? ':' { $setType( LABEL ); } )
                            >
                            > Hmm... I thought the OP said that the LABEL's could NOT have any WS between
                            > the ID and the ':'?

                            Maybe. I don't remember. I'm just going by what a label is in,
                            say, C. The OP is free to delete the "(WS)?"

                            - Paul
                          • Anthony Youngman
                            I don t know the OP s language, but I do know (very well) the AREV class of languages that Monty mentioned. And the original lexer that parsed that language
                            Message 13 of 16 , Nov 5, 2004
                            • 0 Attachment
                              I don't know the OP's language, but I do know (very well) the AREV class
                              of languages that Monty mentioned.

                              And the original lexer that parsed that language used a "state
                              transition table". It's damn difficult to implement that table in an
                              ANTLR lexer. It's damn difficult to implement that table in an ANTLR
                              parser.

                              But creating a filter between the lexer and parser, who's sole purpose
                              is to provide semantic analysis of the token stream from the lexer
                              before it gets to the parser, makes analysing these languages a doddle.

                              Currently the lexer does both token and semantic analysis. It works a
                              lot of the time. But for some languages its a disaster, and it would be
                              nice to be able to split the two jobs apart.

                              Cheers,
                              Wol

                              -----Original Message-----
                              From: John D. Mitchell [mailto:johnm-antlr@...]
                              Sent: 04 November 2004 17:01
                              To: antlr-interest@yahoogroups.com
                              Subject: [antlr-interest] Re: lexical nondeterminism between IDENT &
                              LABEL


                              Well, I don't know what language you're trying to build so it's hard to
                              give you specific advice.

                              If your label construct is truly syntactic then using the fixed version
                              of
                              Monty's example of doing it in the lexer is a reasonable approach.

                              However, if the colon is is overloaded (like '.' in the C language) but
                              the
                              ambiguities are all purely syntactic in nature then doing the more
                              complicated factoring as exemplified by the Number rule in the StdC
                              lexer
                              is a reasonable approach.

                              However, if the ambiguities related to ':' in your language require
                              semantic context then how you should resolve it in the parser depends on
                              the semantics of the language. For an example of this, check out the
                              StdCParser.g and look for the rules using COLON.



                              ****************************************************************************

                              This transmission is intended for the named recipient only. It may contain private and confidential information. If this has come to you in error you must not act on anything disclosed in it, nor must you copy it, modify it, disseminate it in any way, or show it to anyone. Please e-mail the sender to inform us of the transmission error or telephone ECA International immediately and delete the e-mail from your information system.

                              Telephone numbers for ECA International offices are: Sydney +61 (0)2 8272 5300, Hong Kong + 852 2121 2388, London +44 (0)20 7351 5000 and New York +1 212 582 2333.

                              ****************************************************************************
                            Your message has been successfully submitted and would be delivered to recipients shortly.