Loading ...
Sorry, an error occurred while loading the content.

ignoring lexer rules

Expand Messages
  • Ed Sinjiashvili
    Hi, I ve tried to ask Terence about this issue and he pointed me to this ML. So here I am. Suppose I have the following grammar(that describes ... class Dummy
    Message 1 of 2 , Mar 1, 2002
    • 0 Attachment
      Hi,

      I've tried to ask Terence about this issue and he pointed me to this
      ML. So here I am. Suppose I have the following grammar(that describes
      literal strings with escaped octal numbers inside):

      -----
      class Dummy extends Lexer;
      options
      {
      charVocabulary = '\3'..'\177';
      }

      {
      char scanOct(String txt)
      {
      char result = 0;
      try
      {
      result = (char) Integer.parseInt(txt, 8);
      }
      catch (NumberFormatException e)
      {
      result = 0;
      }
      return result;
      }
      }

      STR: '"' ( c = ESCAPE { text.append(c); }
      | ~('\\' | '"')
      )*
      '"'
      ;

      protected
      ESCAPE! returns [char c = 0]
      : '\\'!
      '0'..'7'
      (options {warnWhenFollowAmbig = false;} : '0'..'7'
      (options {warnWhenFollowAmbig = false;} : '0'..'7')? )?
      { c = scanOct($getText); }
      ;
      -----

      I'd like the tokenizer to return my strings already interpolated -
      that is escaped octals should be converted to a char - and parser
      should not be able to tell whether particular character was in string
      literally or resulted from escape substitution. Naturally I used '!'
      on ESCAPE rule to discard matched octals and backslash. This resulted
      in the following java code (narrowed to not include irrelevant stuff):

      -----
      protected final char mESCAPE(boolean _createToken) throws RecognitionException, CharStreamException, TokenStreamException {
      char c = 0; int _ttype; Token _token=null; int _begin=text.length();
      _ttype = ESCAPE;
      int _saveIndex;

      _saveIndex=text.length();
      match('\\');
      text.setLength(_saveIndex);
      _saveIndex=text.length();
      matchRange('0','7');
      text.setLength(_saveIndex);
      [ ... skipped ... ]

      c = scanOct(new String(text.getBuffer(),_begin,text.length()-_begin));
      if ( _createToken && _token==null && _ttype!=Token.SKIP ) {
      _token = makeToken(_ttype);
      _token.setText(new String(text.getBuffer(), _begin, text.length()-_begin));
      }
      _returnToken = _token;
      return c;
      }
      -----

      ANTLR just wraps every alternative with "_saveIndex = text.length();"
      and "text.setLength(_saveIndex);". This causes my scanOct method to
      fail - all matched stuff was discarded by subsequent "_saveIndex"
      wrappers. Besides it looks a little wrong to me - we know that we
      gonna discard all the text, we know where it starts and we know where
      it ends. Why just don't cut it before trying to create a token
      instance? This way actions can access matched text and mess with it.

      I've patched ANTLR.2.7.2a2's JavaCodeGenerator so it produces the
      following java code (no narrowing now):

      -----
      protected final char mESCAPE(boolean _createToken) throws RecognitionException, CharStreamException, TokenStreamException {
      char c = 0;
      int _ttype; Token _token=null; int _begin=text.length();
      _ttype = ESCAPE;
      int _saveIndex;

      _saveIndex=text.length();
      match('\\');
      text.setLength(_saveIndex);
      matchRange('0','7');
      {
      if (((LA(1) >= '0' && LA(1) <= '7'))) {
      matchRange('0','7');
      {
      if (((LA(1) >= '0' && LA(1) <= '7'))) {
      matchRange('0','7');
      }
      else if (((LA(1) >= '\u0003' && LA(1) <= '\u007f'))) {
      }
      else {
      throw new NoViableAltForCharException((char)LA(1), getFilename(), getLine(), getColumn());
      }

      }
      }
      else if (((LA(1) >= '\u0003' && LA(1) <= '\u007f'))) {
      }
      else {
      throw new NoViableAltForCharException((char)LA(1), getFilename(), getLine(), getColumn());
      }

      }
      c = scanOct(new String(text.getBuffer(),_begin,text.length()-_begin));
      text.setLength(_begin);
      if ( _createToken && _token==null && _ttype!=Token.SKIP ) {
      _token = makeToken(_ttype);
      _token.setText(new String(text.getBuffer(), _begin, text.length()-_begin));
      }
      _returnToken = _token;
      return c;
      }
      -----

      As you can see I'm still able to exclude arbitrary matches (backslash
      in the example) from text, then text is available to action. Finally
      I just discard all the text with "text.setLength(_begin)". Thus
      exclamaited rule (the one with ! mark) is seen to actions like ordinal
      rule - the only difference is that text is not propogated. To put it
      more formally - these two pairs of rules are not equivalent in current
      ANTLR-2.7.2a2 (IMHO they should be identical):

      ----- first pair
      STR: '"' ( (! c = ESCAPE) { text.append(c); }
      | ~('\\' | '"')
      )*
      '"'
      ;

      protected
      ESCAPE returns [char c = 0]
      : '\\'!
      '0'..'7' ('0'..'7' ('0'..'7')? )?
      { c = scanOct($getText); }
      ;

      ----- second pair
      STR: '"' ( c = ESCAPE { text.append(c); }
      | ~('\\' | '"')
      )*
      '"'
      ;

      protected
      ESCAPE! returns [char c = 0]
      : '\\'!
      '0'..'7'
      (options {warnWhenFollowAmbig = false;} : '0'..'7'
      (options {warnWhenFollowAmbig = false;} : '0'..'7')? )?
      { c = scanOct($getText); }
      ;
      -----


      --Ed
    • Sinan
      ... I would do something like: import antlr.*; public class MyTokenStreamSelector extends TokenStreamSelector { /** The set of token types to discard */ public
      Message 2 of 2 , Mar 6, 2002
      • 0 Attachment
        Ed Sinjiashvili wrote:
        >
        > Hi,
        >
        > I've tried to ask Terence about this issue and he pointed me to this
        > ML. So here I am. Suppose I have the following grammar(that describes
        > literal strings with escaped octal numbers inside):

        I would do something like:

        import antlr.*;


        public class MyTokenStreamSelector extends TokenStreamSelector {
        /** The set of token types to discard */

        public MyTokenStreamSelector() {
        super();
        }

        public Token nextToken() throws TokenStreamException {
        for(;;;){
        try {
        Token tok = super.nextToken();

        if(tok.getType()==MyParser.STRING){
        //code to replace the octal stuff maybe regular expressions.
        ...........
        }

        //System.out.println("returning:"+tok.getType()+":"+tok);
        return tok;
        }
        catch (TokenStreamRetryException r) {
        // just retry "forever"
        }
        }
        }
        }

        ------------------------
        And then in your MyParser.g



        private MyTokenStreamSelector filter=null;

        public MyParser(MyTokenStreamSelector lexer) {
        this((TokenStream)lexer);
        filter=lexer;
        }


        ------------------
        and where you instantiate Parser/Lexer etc....


        public static MyTokenStreamSelector selector = new
        MyTokenStreamSelector();


        // notify selector about starting lexer; name for convenience
        selector.addInputStream(myMainLexer, "main");
        selector.select("main"); // start with main P lexer

        // Create a parser that reads from the scanner
        myParser = new MyParser(selector);
      Your message has been successfully submitted and would be delivered to recipients shortly.