Loading ...
Sorry, an error occurred while loading the content.

Re: [antlr-interest] Token stream filter

Expand Messages
  • Ric Klaren
    ... That one is a bit inefficient due to an exception per character: This is a dirty one but it works: comment: REMARK ( { if( LA(1) == NEWLINE ) break; } : .
    Message 1 of 18 , Jun 2, 2004
    • 0 Attachment
      On Wed, Jun 02, 2004 at 03:37:15PM +0100, Anthony Youngman wrote:
      > If I don't want to eat up the newline at the end, is the following
      > likely to be a good/sensible parser rule?
      >
      > commentst : REMARK ( (LA(1) != newline) => . )* ;
      >
      > in other words, having found a REMARK, eat everything up to but not
      > including the next newline. Or is LA a lexer-only thing as well?

      That one is a bit inefficient due to an exception per character:

      This is a dirty one but it works:

      comment: REMARK ( { if( LA(1) == NEWLINE ) break; } : . )* ;

      Tip: read the generated code.

      Cheers,

      Ric
      --
      -----+++++*****************************************************+++++++++-------
      ---- Ric Klaren ----- j.klaren@... ----- +31 53 4893755 ----
      -----+++++*****************************************************+++++++++-------
      'And this 'rebooting' business? Give it a good kicking, do you?' 'Oh, no,
      of course, we ... that is ... well, yes, in fact,' said Ponder. 'Adrian
      goes round the back and ... er ... prods it with his foot. But in a
      technical way,' he added. --- From: Hogfather by Terry Pratchett.
    • Monty Zukowski
      ... What s wrong with this? commentst: REMARK (~(NEWLINE))* ; Monty Zukowski ANTLR & Java Consultant -- http://www.codetransform.com ANSI C/GCC transformation
      Message 2 of 18 , Jun 2, 2004
      • 0 Attachment
        On Jun 2, 2004, at 7:37 AM, Anthony Youngman wrote:

        > commentst : REMARK ( (LA(1) != newline) => . )* ;

        What's wrong with this?

        commentst: REMARK (~(NEWLINE))* ;

        Monty Zukowski

        ANTLR & Java Consultant -- http://www.codetransform.com
        ANSI C/GCC transformation toolkit --
        http://www.codetransform.com/gcc.html
        Embrace the Decay -- http://www.codetransform.com/EmbraceDecay.html
      • Anthony Youngman
        Great! Talk about stating the bleeding obvious ... :-) Actually, I was going to ask you and Ter something ... Having looked at your filter stuff, I don t
        Message 3 of 18 , Jun 3, 2004
        • 0 Attachment
          Great! Talk about stating the bleeding obvious ... :-)

          Actually, I was going to ask you and Ter something ...

          Having looked at your filter stuff, I don't remember seeing a LB() (look
          before) function. Is there one, and if not, how easy would it be to
          implement and add to Antlr?

          While it might not be used much, it seems to me to be perfect for
          dealing with this "is it an identifier or token" problem. Rather than
          the mods you made to NextToken and so on, we could then simply have a
          rule

          commentst : {LB(1) == EOL || LB(1) == SEMI} (
          ("*" | "!") ...
          | id:IDENT {if id.getText != "REM" throw recognition-exception}
          ...

          Okay, handling "REM" would be messy :-) Antlr's rule system is great for
          dealing with tokens having different meanings when they follow other
          stuff WITHIN a rule, but as in this case it doesn't always work when one
          of the permitted positions is the first token in the rule ...

          Cheers,
          Wol

          -----Original Message-----
          From: Monty Zukowski [mailto:monty@...]
          Sent: 02 June 2004 18:10
          To: antlr-interest@yahoogroups.com
          Cc: Monty Zukowski
          Subject: Re: [antlr-interest] Token stream filter

          On Jun 2, 2004, at 7:37 AM, Anthony Youngman wrote:

          > commentst : REMARK ( (LA(1) != newline) => . )* ;

          What's wrong with this?

          commentst: REMARK (~(NEWLINE))* ;

          Monty Zukowski

          ANTLR & Java Consultant -- http://www.codetransform.com
          ANSI C/GCC transformation toolkit --
          http://www.codetransform.com/gcc.html
          Embrace the Decay -- http://www.codetransform.com/EmbraceDecay.html


          ****************************************************************************

          This transmission is intended for the named recipient only. It may contain private and confidential information. If this has come to you in error you must not act on anything disclosed in it, nor must you copy it, modify it, disseminate it in any way, or show it to anyone. Please e-mail the sender to inform us of the transmission error or telephone ECA International immediately and delete the e-mail from your information system.

          Telephone numbers for ECA International offices are: Sydney +61 (0)2 8272 5300, Hong Kong + 852 2121 2388, London +44 (0)20 7351 5000 and New York +1 212 582 2333.

          ****************************************************************************
        • Anthony Youngman
          Thanks. Actually, Monty s solution should work ... but seeing as you seem to know these things, taking this line from my original post (id:IDENT {if text !=
          Message 4 of 18 , Jun 3, 2004
          • 0 Attachment
            Thanks. Actually, Monty's solution should work ...

            but seeing as you seem to know these things, taking this line from my
            original post

            (id:IDENT {if text != "REM" throw tokenmatchexception}|"*"|"!")

            which is the exception I need to throw here?

            I think if I've got this, I've got enough to write my filter :-) While
            the LB() function might be useful, further thought on what Monty said
            made me think it might not be needed.

            So - I can feed the lexer output into my deremer parser - and I can then
            feed the output from that into my main parser?

            And if I have a rule like

            commentst : (EOL | SEMI) ("*" | "!")! (~(EOL)*)! ;

            it will then eat everything between the initial eol/semi and final eol,
            but it will let those two tokens through to the next parser?

            Cheers,
            Wol

            -----Original Message-----
            From: Ric Klaren [mailto:klaren@...]
            Sent: 02 June 2004 16:46
            To: antlr-interest@yahoogroups.com
            Subject: Re: [antlr-interest] Token stream filter

            On Wed, Jun 02, 2004 at 03:37:15PM +0100, Anthony Youngman wrote:
            > If I don't want to eat up the newline at the end, is the following
            > likely to be a good/sensible parser rule?
            >
            > commentst : REMARK ( (LA(1) != newline) => . )* ;
            >
            > in other words, having found a REMARK, eat everything up to but not
            > including the next newline. Or is LA a lexer-only thing as well?

            That one is a bit inefficient due to an exception per character:

            This is a dirty one but it works:

            comment: REMARK ( { if( LA(1) == NEWLINE ) break; } : . )* ;

            Tip: read the generated code.

            Cheers,

            Ric
            --
            -----+++++*****************************************************+++++++++
            -------
            ---- Ric Klaren ----- j.klaren@... ----- +31 53 4893755 ----
            -----+++++*****************************************************+++++++++
            -------
            'And this 'rebooting' business? Give it a good kicking, do you?' 'Oh,
            no,
            of course, we ... that is ... well, yes, in fact,' said Ponder.
            'Adrian
            goes round the back and ... er ... prods it with his foot. But in a
            technical way,' he added. --- From: Hogfather by Terry Pratchett.


            ****************************************************************************

            This transmission is intended for the named recipient only. It may contain private and confidential information. If this has come to you in error you must not act on anything disclosed in it, nor must you copy it, modify it, disseminate it in any way, or show it to anyone. Please e-mail the sender to inform us of the transmission error or telephone ECA International immediately and delete the e-mail from your information system.

            Telephone numbers for ECA International offices are: Sydney +61 (0)2 8272 5300, Hong Kong + 852 2121 2388, London +44 (0)20 7351 5000 and New York +1 212 582 2333.

            ****************************************************************************
          • Ric Klaren
            ... It looks a lot simpler ;) ... I m only theorizing ;) ... If you re trying to make the rule work inside a ( )= ( ) construct then it should be something
            Message 5 of 18 , Jun 3, 2004
            • 0 Attachment
              On Thu, Jun 03, 2004 at 09:24:07AM +0100, Anthony Youngman wrote:
              > Thanks. Actually, Monty's solution should work ...

              It looks a lot simpler ;)

              > but seeing as you seem to know these things, taking this line from my
              > original post

              I'm only theorizing ;)

              > (id:IDENT {if text != "REM" throw tokenmatchexception}|"*"|"!")
              >
              > which is the exception I need to throw here?

              If you're trying to make the rule work inside a ( )=>( ) construct then it
              should be something RecognitionException like (or derived of it)

              > So - I can feed the lexer output into my deremer parser - and I can then
              > feed the output from that into my main parser?

              If you follow Monty's framework you should be ok I guess.

              > And if I have a rule like
              >
              > commentst : (EOL | SEMI) ("*" | "!")! (~(EOL)*)! ;
              >
              > it will then eat everything between the initial eol/semi and final eol,
              > but it will let those two tokens through to the next parser?

              If you write the code in nextToken to do that it will. The ! operator
              controls treebuilding it's not the lexer's ! operator. At least I was under
              the impression you wanted to use a parser to do the filtering not a lexer
              in front of your original lexer.

              Cheers,

              Ric
              --
              -----+++++*****************************************************+++++++++-------
              ---- Ric Klaren ----- j.klaren@... ----- +31 53 4893755 ----
              -----+++++*****************************************************+++++++++-------
              Time what is time - I wish I knew how to tell You why - It hurts to know -
              Aren't we machines - Time what is time - Unlock the door
              - And see the truth - Then time is time again
              From: 'Time what is Time' by Blind Guardian
            • Anthony Youngman
              ... under ... lexer ... Bugger :-( Yes, I did want to use a parser between my original lexer and parser. Or can I put a lexer there instead? Basically, I don t
              Message 6 of 18 , Jun 3, 2004
              • 0 Attachment
                > If you write the code in nextToken to do that it will. The ! operator
                > controls treebuilding it's not the lexer's ! operator. At least I was
                under
                > the impression you wanted to use a parser to do the filtering not a
                lexer
                > in front of your original lexer.

                Bugger :-(

                Yes, I did want to use a parser between my original lexer and parser. Or
                can I put a lexer there instead? Basically, I don't care whether it's a
                lexer or parser, I just want to sit it between my primary lexer and
                parser to strip out stuff I don't want and/or modify stuff I do.

                Can I lex a token stream as well as a character stream? And if so, will
                the second lexer see hidden tokens (I presume not).

                The trouble is (hint to Ter for the manual :-) that there's a chapter on
                lexing, and a chapter on treeparsing, but nothing on parsing. And the
                stuff on token streams implies substituting different lexers for
                different things. I want to process the data in multiple passes, not
                change to a different lexer.

                Cheers,
                Wol

                -----Original Message-----
                From: Ric Klaren [mailto:klaren@...]
                Sent: 03 June 2004 09:44
                To: antlr-interest@yahoogroups.com
                Subject: Re: [antlr-interest] Token stream filter

                On Thu, Jun 03, 2004 at 09:24:07AM +0100, Anthony Youngman wrote:
                > Thanks. Actually, Monty's solution should work ...

                It looks a lot simpler ;)

                > but seeing as you seem to know these things, taking this line from my
                > original post

                I'm only theorizing ;)

                > (id:IDENT {if text != "REM" throw tokenmatchexception}|"*"|"!")
                >
                > which is the exception I need to throw here?

                If you're trying to make the rule work inside a ( )=>( ) construct then
                it
                should be something RecognitionException like (or derived of it)

                > So - I can feed the lexer output into my deremer parser - and I can
                then
                > feed the output from that into my main parser?

                If you follow Monty's framework you should be ok I guess.

                > And if I have a rule like
                >
                > commentst : (EOL | SEMI) ("*" | "!")! (~(EOL)*)! ;
                >
                > it will then eat everything between the initial eol/semi and final
                eol,
                > but it will let those two tokens through to the next parser?

                If you write the code in nextToken to do that it will. The ! operator
                controls treebuilding it's not the lexer's ! operator. At least I was
                under
                the impression you wanted to use a parser to do the filtering not a
                lexer
                in front of your original lexer.

                Cheers,

                Ric
                --
                -----+++++*****************************************************+++++++++
                -------
                ---- Ric Klaren ----- j.klaren@... ----- +31 53 4893755 ----
                -----+++++*****************************************************+++++++++
                -------
                Time what is time - I wish I knew how to tell You why - It hurts to
                know -
                Aren't we machines - Time what is time - Unlock the door
                - And see the truth - Then time is time again
                From: 'Time what is Time' by Blind Guardian




                Yahoo! Groups Links







                ****************************************************************************

                This transmission is intended for the named recipient only. It may contain private and confidential information. If this has come to you in error you must not act on anything disclosed in it, nor must you copy it, modify it, disseminate it in any way, or show it to anyone. Please e-mail the sender to inform us of the transmission error or telephone ECA International immediately and delete the e-mail from your information system.

                Telephone numbers for ECA International offices are: Sydney +61 (0)2 8272 5300, Hong Kong + 852 2121 2388, London +44 (0)20 7351 5000 and New York +1 212 582 2333.

                ****************************************************************************
              • Ric Klaren
                ... Life is never easy ;) ... Well technically you could put a lexer there but it would probably teach you more about antlr internals than you want to know ;)
                Message 7 of 18 , Jun 3, 2004
                • 0 Attachment
                  On Thu, Jun 03, 2004 at 09:58:27AM +0100, Anthony Youngman wrote:
                  > > If you write the code in nextToken to do that it will. The ! operator
                  > > controls treebuilding it's not the lexer's ! operator. At least I was under
                  > > the impression you wanted to use a parser to do the filtering not a lexer
                  > > in front of your original lexer.
                  >
                  > Bugger :-(

                  Life is never easy ;)

                  > Yes, I did want to use a parser between my original lexer and parser. Or
                  > can I put a lexer there instead? Basically, I don't care whether it's a
                  > lexer or parser, I just want to sit it between my primary lexer and
                  > parser to strip out stuff I don't want and/or modify stuff I do.

                  Well technically you could put a lexer there but it would probably teach
                  you more about antlr internals than you want to know ;) Seriously though,
                  you can probably do everything you want with an extra parser in between
                  stuff.

                  > Can I lex a token stream as well as a character stream? And if so, will
                  > the second lexer see hidden tokens (I presume not).

                  You'll be parsing the token stream and storing tokens in a list/queue while
                  you don't know what to do with them yet.

                  You get a setup where:

                  1. Your original parser asks the filter parser for a token. (by calling LA(x))
                  2. Your filter parser comes into nextToken
                  1. sees wether it has tokens queued to pass on
                  2. if not scarf tokens from the original lexer and see what should be
                  done with them
                  3. pass leftover bits to the calling parser

                  This is the setup that is explained in Monty's filter example. He's using a
                  queue to store tokens that are still considered to be passed on.

                  Consider the nextToken of his filter:

                  public Token nextToken() throws TokenStreamException
                  {
                  Token ret;
                  if (queue.length()<=0)
                  {
                  try
                  {
                  jumpStatements();
                  }
                  catch ( RecognitionException e) {;}
                  catch ( TokenStreamException e) {;}
                  }
                  if (queue.length()>0)
                  {
                  ret = queue.elementAt(0);
                  queue.removeFirst();
                  return ret;
                  }
                  return new ArevToken(Token.EOF_TYPE,"");
                  }

                  jumpStatements in the above is a normal parser rule e.g. it get's it's
                  tokens from the original lexer. generally this will do a number of LA(x)
                  calls to get tokens these are checked with the match methods which again
                  call the consume method when the match is succesfull. This is the mechanism
                  you want to use.

                  I'm not 100% sure if it will work but you can probably add a store boolean
                  attribute to the filter and then:

                  Original filter rule:

                  commentst : (EOL | SEMI) ("*" | "!")! (~(EOL)*)! ;

                  New:

                  commentst : { store = true; } (EOL | SEMI) { store = false; }
                  ("*" | "!")! (~(EOL)*)! { store = true } ;

                  Then in consume you'll only append tokens when store is true. You might
                  need a 'copy-the-rest' rule as well. Not sure since I'm only just now
                  tinkering with this stuff myself as well (which also explains my interest
                  in the topic...).

                  You can also handcode nextToken to parse'n'massage the incoming token
                  stream. Considering the complexity of the commentst rule this might be a
                  good option. Actually it's hardly worth using a parser for it. This might
                  come a long way just implement the tokenstream abstract interface and use
                  something like this for filter (this lacks exception handling):

                  public Token nextToken() throws TokenStreamException
                  {
                  if( queue.size() <= 0 )
                  {
                  // look for tokens
                  if ((LA(1) == EOL) ||(LA(1) == SEMI)) {
                  queue.append(LT(1));
                  origlexer.consume();

                  if ((LA(1) == STAR) || (LA(1) == BANG)) {
                  // don't queue
                  origlexer.consume();
                  while( LA(1) != EOL )
                  {
                  // don't queue
                  origlexer.consume();
                  }
                  }
                  else
                  {
                  queue.append(LT(1));
                  origlexer.consume();
                  }
                  }
                  else
                  {
                  queue.append(LT(1));
                  origlexer.consume();
                  }
                  }
                  if( queue.size() > 0 )
                  ... return front token of the queue ....
                  else
                  ... return EOFTOKEN ...
                  }

                  Monty's original probably needs some tinkering with respect to
                  EOF/EOL/exception handling for your case.

                  Cheers,

                  Ric
                  --
                  -----+++++*****************************************************+++++++++-------
                  ---- Ric Klaren ----- j.klaren@... ----- +31 53 4893755 ----
                  -----+++++*****************************************************+++++++++-------
                  Innovation makes enemies of all those who prospered under the old
                  regime, and only lukewarm support is forthcoming from those who would
                  prosper under the new. --- Niccolò Machiavelli
                • Anthony Youngman
                  The more I read your comments and Monty s article, the clearer it all becomes. But it s a lot to get my brain round. I want to avoid tinkering, for two reasons
                  Message 8 of 18 , Jun 3, 2004
                  • 0 Attachment
                    The more I read your comments and Monty's article, the clearer it all becomes. But it's a lot to get my brain round. I want to avoid tinkering, for two reasons ... (1) I don't understand Java (or OO programming generally) so the less I need to tackle at once, the easier the learning curve, and (2) I want to convert all this to C++ once I've got a functional grammar that does what I need (I've currently got the tree-parser chucking out an assembler, which I am successfully executing using an interpreter :-)

                    Seeing as you're heavily involved in all this Antlr stuff :-) would it be possible to add a Filter class to the existing Lexer and Parser classes? There's all this over-riding stuff in what Monty's done and I guess you're into the same sort of thing. And it would be so nice to be able to massage the data stream - for me, for Monty, and maybe for you - as it is passed from the lexer to the parser.

                    It would be something like

                    class BASICFilterParser extends FilterParser - taking a token stream from the lexer (or another Filter) and returning a token stream to a parser or filter.

                    It would then take standard rules, so I would have stuff like

                    commentst : (SEMI|EOL) ("!"|"*")! (~(EOL)!*) ;
                    remarkst : (SEMI|EOL) id:IDENT! {if id.getText != "REM" throw recognitionexception} ... ;

                    where as you guessed, I expect "!" to mean "don't pass this token on" (or mark it hidden, or whatever).

                    And I would have to trap Monty's "(GOTO | GO (TO)?) tok:. {if tok.getType != IDENT then tok.setType)IDENT)" stuff ...

                    Basically the whole thing is a set of rules that, if a match is found, allows the user to manipulate the token stream. Exactly what TokenFilter does at the moment, but using standard grammar rules without making the user over-ride all the internal functions like NextToken (I'm in this way out of my depth here ...)

                    I haven't totally sussed things here, but we'd need the filter to simply try each rule in turn until one succeeded, and have a default "match anything" rule (I haven't sussed whether the Antlr parser objects to unrecognised tokens ...). And on matching a rule, it simply queues the lot in a buffer for NextToken, before running the rules again when it runs out of buffered tokens.

                    A little bit of history about the language me, and Monty, and RobC are all trying to parse ... the original DATABASIC dialect was written in assembler using a (now lost) state transition table. INFOBASIC (the dialect I'm best familiar with) was written by a bunch of people whos' attitude was "if we think it's crap, we'll leave it out. Compatibility is nice but not necessary". This variant never had the REM keyword! UVBasic (the dialect I now use) was written using yacc/lex, and compatibility was very important, so REM can lex as a label, a comment, a function call, and a variable - except the compiler can get confused resulting in compilation errors :-( I'd like to follow the INFOBASIC line and just drop the "REM" keyword, but I suspect it would break far too much code ... And Monty's dialect, AREV BASIC? very similar to INFOBASIC, I believe, but I don't really know anything about it. RobC - if it compiles anywhere, he wants his compiler to compile it :-)

                    Actually, while composing the previous paragraph in my head, I just realised something VERY useful! I seem to remember reading somewhere, that Antlr wasn't good at state-table type parsing. This FilterParser class would make an almost perfect state-table parser! The lexer would simply lex into IDENTs and NUMs or whatever, then the FilterParser would have a state variable which would be tested in a rule predicate, and within that you simply check the tokens that come through!

                    Cheers,
                    Wol

                    -----Original Message-----
                    From: Ric Klaren [mailto:klaren@...]
                    Sent: 03 June 2004 11:27
                    To: antlr-interest@yahoogroups.com
                    Subject: Re: [antlr-interest] Token stream filter

                    On Thu, Jun 03, 2004 at 09:58:27AM +0100, Anthony Youngman wrote:
                    > > If you write the code in nextToken to do that it will. The ! operator
                    > > controls treebuilding it's not the lexer's ! operator. At least I was under
                    > > the impression you wanted to use a parser to do the filtering not a lexer
                    > > in front of your original lexer.
                    >
                    > Bugger :-(

                    Life is never easy ;)

                    > Yes, I did want to use a parser between my original lexer and parser. Or
                    > can I put a lexer there instead? Basically, I don't care whether it's a
                    > lexer or parser, I just want to sit it between my primary lexer and
                    > parser to strip out stuff I don't want and/or modify stuff I do.

                    Well technically you could put a lexer there but it would probably teach
                    you more about antlr internals than you want to know ;) Seriously though,
                    you can probably do everything you want with an extra parser in between
                    stuff.

                    > Can I lex a token stream as well as a character stream? And if so, will
                    > the second lexer see hidden tokens (I presume not).

                    You'll be parsing the token stream and storing tokens in a list/queue while
                    you don't know what to do with them yet.

                    You get a setup where:

                    1. Your original parser asks the filter parser for a token. (by calling LA(x))
                    2. Your filter parser comes into nextToken
                    1. sees wether it has tokens queued to pass on
                    2. if not scarf tokens from the original lexer and see what should be
                    done with them
                    3. pass leftover bits to the calling parser

                    This is the setup that is explained in Monty's filter example. He's using a
                    queue to store tokens that are still considered to be passed on.

                    Consider the nextToken of his filter:

                    public Token nextToken() throws TokenStreamException
                    {
                    Token ret;
                    if (queue.length()<=0)
                    {
                    try
                    {
                    jumpStatements();
                    }
                    catch ( RecognitionException e) {;}
                    catch ( TokenStreamException e) {;}
                    }
                    if (queue.length()>0)
                    {
                    ret = queue.elementAt(0);
                    queue.removeFirst();
                    return ret;
                    }
                    return new ArevToken(Token.EOF_TYPE,"");
                    }

                    jumpStatements in the above is a normal parser rule e.g. it get's it's
                    tokens from the original lexer. generally this will do a number of LA(x)
                    calls to get tokens these are checked with the match methods which again
                    call the consume method when the match is succesfull. This is the mechanism
                    you want to use.

                    I'm not 100% sure if it will work but you can probably add a store boolean
                    attribute to the filter and then:

                    Original filter rule:

                    commentst : (EOL | SEMI) ("*" | "!")! (~(EOL)*)! ;

                    New:

                    commentst : { store = true; } (EOL | SEMI) { store = false; }
                    ("*" | "!")! (~(EOL)*)! { store = true } ;

                    Then in consume you'll only append tokens when store is true. You might
                    need a 'copy-the-rest' rule as well. Not sure since I'm only just now
                    tinkering with this stuff myself as well (which also explains my interest
                    in the topic...).

                    You can also handcode nextToken to parse'n'massage the incoming token
                    stream. Considering the complexity of the commentst rule this might be a
                    good option. Actually it's hardly worth using a parser for it. This might
                    come a long way just implement the tokenstream abstract interface and use
                    something like this for filter (this lacks exception handling):

                    public Token nextToken() throws TokenStreamException
                    {
                    if( queue.size() <= 0 )
                    {
                    // look for tokens
                    if ((LA(1) == EOL) ||(LA(1) == SEMI)) {
                    queue.append(LT(1));
                    origlexer.consume();

                    if ((LA(1) == STAR) || (LA(1) == BANG)) {
                    // don't queue
                    origlexer.consume();
                    while( LA(1) != EOL )
                    {
                    // don't queue
                    origlexer.consume();
                    }
                    }
                    else
                    {
                    queue.append(LT(1));
                    origlexer.consume();
                    }
                    }
                    else
                    {
                    queue.append(LT(1));
                    origlexer.consume();
                    }
                    }
                    if( queue.size() > 0 )
                    ... return front token of the queue ....
                    else
                    ... return EOFTOKEN ...
                    }

                    Monty's original probably needs some tinkering with respect to
                    EOF/EOL/exception handling for your case.

                    Cheers,

                    Ric
                    --
                    -----+++++*****************************************************+++++++++-------
                    ---- Ric Klaren ----- j.klaren@... ----- +31 53 4893755 ----
                    -----+++++*****************************************************+++++++++-------
                    Innovation makes enemies of all those who prospered under the old
                    regime, and only lukewarm support is forthcoming from those who would
                    prosper under the new. --- Niccolò Machiavelli




                    Yahoo! Groups Links







                    ****************************************************************************

                    This transmission is intended for the named recipient only. It may contain private and confidential information. If this has come to you in error you must not act on anything disclosed in it, nor must you copy it, modify it, disseminate it in any way, or show it to anyone. Please e-mail the sender to inform us of the transmission error or telephone ECA International immediately and delete the e-mail from your information system.

                    Telephone numbers for ECA International offices are: Sydney +61 (0)2 8272 5300, Hong Kong + 852 2121 2388, London +44 (0)20 7351 5000 and New York +1 212 582 2333.

                    ****************************************************************************
                  • Ric Klaren
                    ... It s not a bad idea I guess, but the problem is that it needs changes in the codegenerator. At least if you want that syntax with the ! to work. ... I
                    Message 9 of 18 , Jun 3, 2004
                    • 0 Attachment
                      On Thu, Jun 03, 2004 at 01:00:16PM +0100, Anthony Youngman wrote:
                      > The more I read your comments and Monty's article, the clearer it all
                      > becomes. But it's a lot to get my brain round. I want to avoid tinkering,
                      > for two reasons ... (1) I don't understand Java (or OO programming
                      > generally) so the less I need to tackle at once, the easier the learning
                      > curve, and (2) I want to convert all this to C++ once I've got a functional
                      > grammar that does what I need (I've currently got the tree-parser chucking
                      > out an assembler, which I am successfully executing using an interpreter
                      > :-)

                      > Seeing as you're heavily involved in all this Antlr stuff :-) would it be
                      > possible to add a Filter class to the existing Lexer and Parser classes?
                      > There's all this over-riding stuff in what Monty's done and I guess you're
                      > into the same sort of thing. And it would be so nice to be able to massage
                      > the data stream - for me, for Monty, and maybe for you - as it is passed
                      > from the lexer to the parser.
                      >
                      > It would be something like
                      >
                      > class BASICFilterParser extends FilterParser - taking a token stream from
                      > the lexer (or another Filter) and returning a token stream to a parser or
                      > filter.
                      >
                      > It would then take standard rules, so I would have stuff like
                      >
                      > commentst : (SEMI|EOL) ("!"|"*")! (~(EOL)!*) ;
                      > remarkst : (SEMI|EOL) id:IDENT! {if id.getText != "REM" throw recognitionexception} ... ;
                      >
                      > where as you guessed, I expect "!" to mean "don't pass this token on" (or
                      > mark it hidden, or whatever).

                      It's not a bad idea I guess, but the problem is that it needs changes in
                      the codegenerator. At least if you want that syntax with the '!' to work.

                      > Basically the whole thing is a set of rules that, if a match is found,
                      > allows the user to manipulate the token stream. Exactly what TokenFilter
                      > does at the moment, but using standard grammar rules without making the
                      > user over-ride all the internal functions like NextToken (I'm in this way
                      > out of my depth here ...)

                      I can give you a hand with that off list if you want. If you only add a few
                      bits of fluff to that handcoded bit I posted previously, then it should do
                      the trick.

                      Or if you want to invest in something more generic I guess it should be
                      possible to make a custom base class for the parser (the FilterParser you
                      suggested) and modify how it filters with a list of functor objects
                      wrapping rules. E.g. take Ter's TokenStreams and wrap complete parser rules
                      into RewriteOperation objects (can one do that in java? hmm would be quite
                      fun in C++ anyway :) use some template metaprogramming to glue rules
                      together and try different ones upon the success of a previous one) Would
                      give something similar to Ter's tokenstreams although probably with some
                      slightly different applications. (hmm guess I got inspired ;) )

                      > I haven't totally sussed things here, but we'd need the filter to simply
                      > try each rule in turn until one succeeded, and have a default "match
                      > anything" rule (I haven't sussed whether the Antlr parser objects to
                      > unrecognised tokens ...).

                      You can use a wildcard '.' (without the quotes) for that.

                      > And on matching a rule, it simply queues the lot in a buffer for NextToken,
                      > before running the rules again when it runs out of buffered tokens.

                      After some quick'n'dirty copy paste in various spots (Thanks Monty & Ter;) ):

                      ---snip---
                      import antlr.TokenQueue;
                      import antlr.TokenStream;

                      public class DeREMer implements TokenStream {
                      protected TokenQueue queue = new TokenQueue(6);
                      protected TokenStream stream;

                      public DeREMer(TokenStream upstream) {
                      stream = upstream;
                      }
                      public Token nextToken() throws TokenStreamException
                      {
                      if( queue.size() <= 0 )
                      {
                      // look for tokens
                      if ((LA(1) == EOL) ||(LA(1) == SEMI)) {
                      queue.append(LT(1));
                      stream.consume();

                      if ((LA(1) == STAR) || (LA(1) == BANG)) {
                      // don't queue
                      stream.consume();
                      while( LA(1) != EOL )
                      {
                      // don't queue
                      stream.consume();
                      }
                      }
                      else
                      {
                      queue.append(LT(1));
                      stream.consume();
                      }
                      }
                      else
                      {
                      queue.append(LT(1));
                      stream.consume();
                      }
                      }
                      if( queue.size() > 0 )
                      {
                      ret = queue.elementAt(0);
                      queue.removeFirst();
                      return ret;
                      }
                      return new CommonToken(Token.EOF_TYPE,"");
                      }
                      };
                      ---snip---

                      Now :

                      YourLexer l = new YourLexer( <yourinput> );
                      DeREMer d = new DeREMer(l);
                      YourParser p(d);

                      That should mostly work I guess. Probably missing some imports and antlr
                      namespace qualifications. But those should be mostly trivial to fix. Feel
                      free to contact me offlist if you need a hand.

                      Cheers,

                      Ric
                      --
                      -----+++++*****************************************************+++++++++-------
                      ---- Ric Klaren ----- j.klaren@... ----- +31 53 4893755 ----
                      -----+++++*****************************************************+++++++++-------
                      Chaos is found in greatest abundance wherever order is being sought.
                      --- Terry Pratchet
                    • Monty Zukowski
                      Just overrride nextToken() by subclassing your generated lexer and have it put the token it returns into the LB buffer. Monty ... ANTLR & Java Consultant --
                      Message 10 of 18 , Jun 3, 2004
                      • 0 Attachment
                        Just overrride nextToken() by subclassing your generated lexer and have
                        it put the token it returns into the LB buffer.

                        Monty

                        On Jun 3, 2004, at 12:09 AM, Anthony Youngman wrote:

                        > Having looked at your filter stuff, I don't remember seeing a LB()
                        > (look
                        > before) function. Is there one, and if not, how easy would it be to
                        > implement and add to Antlr?


                        ANTLR & Java Consultant -- http://www.codetransform.com
                        ANSI C/GCC transformation toolkit --
                        http://www.codetransform.com/gcc.html
                        Embrace the Decay -- http://www.codetransform.com/EmbraceDecay.html
                      Your message has been successfully submitted and would be delivered to recipients shortly.