Loading ...
Sorry, an error occurred while loading the content.

Re: more lexical determinism

Expand Messages
  • howardckatz
    ... That didn t quite do it, I think, Doesn t the above say that anything starting with a Letter is a Word? But that s not what I want, since valid Identifiers
    Message 1 of 9 , Dec 5, 2001
    • 0 Attachment
      --- In antlr-interest@y..., Terence Parr <parrt@j...> wrote:

      ...

      > As for distinguishing between the two kinds of words/ids, you could
      > do the following in one rule (assume Word unless you see _ or
      > digit):
      >
      > Word: ( Letter | '_' {$setType(Identifier);}) (Letter |
      > Digit{$setType(Identifier);})*;

      That didn't quite do it, I think, Doesn't the above say that anything
      starting with a Letter is a Word? But that's not what I want, since
      valid Identifiers can start with Letters too. The following should be
      legal input,

      id : word

      but throws an "Unexpected token: id" error. I would guess the parser
      sees this as "Word : Word" and accordingly chokes. Or am I
      misunderstanding something?

      Howard
    • Sinan
      ... There is no way lexer can distinguish between word and id, since they have the same production ( or id is a subset of word....) If you want to make the
      Message 2 of 9 , Dec 5, 2001
      • 0 Attachment
        howardckatz wrote:
        >
        > --- In antlr-interest@y..., Terence Parr <parrt@j...> wrote:
        >
        > ...
        >
        > > As for distinguishing between the two kinds of words/ids, you could
        > > do the following in one rule (assume Word unless you see _ or
        > > digit):
        > >
        > > Word: ( Letter | '_' {$setType(Identifier);}) (Letter |
        > > Digit{$setType(Identifier);})*;
        >
        > That didn't quite do it, I think, Doesn't the above say that anything
        > starting with a Letter is a Word? But that's not what I want, since
        > valid Identifiers can start with Letters too. The following should be
        > legal input,
        >
        > id : word
        >
        > but throws an "Unexpected token: id" error. I would guess the parser
        > sees this as "Word : Word" and accordingly chokes. Or am I
        > misunderstanding something?
        >
        > Howard

        There is no way lexer can distinguish between word and id, since they
        have the
        same production ( or id is a subset of word....)

        If you want to make the distinction in lexer, then you have to do
        something like

        AnId : (Id Colon Word)=> Id ;


        But then you cant haver an Id without a Colon following.

        One expensive way to do it is to pull everything into the Parser except
        characters , then

        rule1 : id Colon word ;

        id: Character+ ;

        word : Character+ ;

        or whatever....

        But now you will get a zillion non-determinisms , which you fix by

        rules:
        (rule1)=> rule1
        | (rule2)=> rule2
        | etc....
        ;

        This tends to be very expensive, but almost unavoidable in cases like
        Fortran
        where whitespace has no meaning.

        Don't forget that the lexer rules(productions/methods) are not called
        by parser.
        Actually , if it is not protected, then they are call from nextToken in
        some magical order
        and the first maximum match will win....

        So you'll either get all either all words or all ids ( except when "_"
        is present)....

        Sinan
      • howardckatz
        This has been an interesting exercise. I can see that this particular problem -- where two tokens consist of closely overlapping character sets -- is one that
        Message 3 of 9 , Dec 5, 2001
        • 0 Attachment
          This has been an interesting exercise. I can see that this particular
          problem -- where two tokens consist of closely overlapping character
          sets -- is one that antlr doesn't handle that well. I can see one
          other approach that might work -- sticking some string-parsing Java
          code of my own either into the parser grammar or maybe in a
          downstream TokenStream. Time to play I guess ...

          Thanks for your help,
          Howard

          --- In antlr-interest@y..., Sinan <sinan.karasu@b...> wrote:
          > howardckatz wrote:
          > >
          > > --- In antlr-interest@y..., Terence Parr <parrt@j...> wrote:
          > >
          > > ...
          > >
          > > > As for distinguishing between the two kinds of words/ids, you
          could
          > > > do the following in one rule (assume Word unless you see _ or
          > > > digit):
          > > >
          > > > Word: ( Letter | '_' {$setType(Identifier);}) (Letter |
          > > > Digit{$setType(Identifier);})*;
          > >
          > > That didn't quite do it, I think, Doesn't the above say that
          anything
          > > starting with a Letter is a Word? But that's not what I want,
          since
          > > valid Identifiers can start with Letters too. The following
          should
          be
          > > legal input,
          > >
          > > id : word
          > >
          > > but throws an "Unexpected token: id" error. I would guess the
          parser
          > > sees this as "Word : Word" and accordingly chokes. Or am I
          > > misunderstanding something?
          > >
          > > Howard
          >
          > There is no way lexer can distinguish between word and id, since
          they
          > have the
          > same production ( or id is a subset of word....)
          >
          > If you want to make the distinction in lexer, then you have to do
          > something like
          >
          > AnId : (Id Colon Word)=> Id ;
          >
          >
          > But then you cant haver an Id without a Colon following.
          >
          > One expensive way to do it is to pull everything into the Parser
          except
          > characters , then
          >
          > rule1 : id Colon word ;
          >
          > id: Character+ ;
          >
          > word : Character+ ;
          >
          > or whatever....
          >
          > But now you will get a zillion non-determinisms , which you fix by
          >
          > rules:
          > (rule1)=> rule1
          > | (rule2)=> rule2
          > | etc....
          > ;
          >
          > This tends to be very expensive, but almost unavoidable in cases
          like
          > Fortran
          > where whitespace has no meaning.
          >
          > Don't forget that the lexer rules(productions/methods) are not
          called
          > by parser.
          > Actually , if it is not protected, then they are call from
          nextToken
          in
          > some magical order
          > and the first maximum match will win....
          >
          > So you'll either get all either all words or all ids ( except when
          "_"
          > is present)....
          >
          > Sinan
        • Sinan
          ... yacc/lex won t either. what you should do really is assume a could have _ and b can t. then you really have something like rule : (a | b) COLON b; so in
          Message 4 of 9 , Dec 6, 2001
          • 0 Attachment
            howardckatz wrote:
            >
            > This has been an interesting exercise. I can see that this particular
            > problem -- where two tokens consist of closely overlapping character
            > sets -- is one that antlr doesn't handle that well. I can see one
            > other approach that might work -- sticking some string-parsing Java
            > code of my own either into the parser grammar or maybe in a
            > downstream TokenStream. Time to play I guess ...
            >

            yacc/lex won't either.

            what you should do really is
            assume a could have '_' and b can't.

            then you really have something like

            rule : (a | b) COLON b;

            so in lexer you say

            B : ( LETTER | DIGIT | '_' { set type to A})+;


            in parser

            rule: (A | B) COLON B;

            or pushing into other rules

            rule : id COLON word ;

            id : A | B;
            word : B;

            Sinan
          • tbrandonau
            You want anything with all letters to be a word and anything with a _ or digit to be a identifier right? So can t you just have: Word: ( Letter ... )+ ; i.e.
            Message 5 of 9 , Dec 6, 2001
            • 0 Attachment
              You want anything with all letters to be a word and anything with
              a '_' or digit to be a identifier right? So can't you just have:
              Word:
              (
              Letter
              | '_' {$setType(Identifier);}
              | Digit {$setType(Identifier);}
              )+
              ;
              i.e. if its got an '_' or a digit its an identifier otherwise its a
              word.

              But, you have non-determinism in that "Hello" is a valid word and a
              valid identifier, and it will get recognized as a valid Word. So in
              the parser you'd need:
              pair: (Identifier|Word) COLON Word;
              Then you could create an Identifier Token\AST for the LHS Word in the
              parser.

              Tom.
              --- In antlr-interest@y..., "howardckatz" <howardk@f...> wrote:
              > --- In antlr-interest@y..., Terence Parr <parrt@j...> wrote:
              >
              > ...
              >
              > > As for distinguishing between the two kinds of words/ids, you
              could
              > > do the following in one rule (assume Word unless you see _ or
              > > digit):
              > >
              > > Word: ( Letter | '_' {$setType(Identifier);}) (Letter |
              > > Digit{$setType(Identifier);})*;
              >
              > That didn't quite do it, I think, Doesn't the above say that
              anything
              > starting with a Letter is a Word? But that's not what I want, since
              > valid Identifiers can start with Letters too. The following should
              be
              > legal input,
              >
              > id : word
              >
              > but throws an "Unexpected token: id" error. I would guess the
              parser
              > sees this as "Word : Word" and accordingly chokes. Or am I
              > misunderstanding something?
              >
              > Howard
            Your message has been successfully submitted and would be delivered to recipients shortly.