Loading ...
Sorry, an error occurred while loading the content.
 

charVocabulary having no effect

Expand Messages
  • André Næss
    I m struggling a bit with charVocabulary. After getting lot s of strange unexpected character errors I figured that this was a rather important option. I
    Message 1 of 4 , Dec 13, 2004
      I'm struggling a bit with charVocabulary. After getting lot's of
      strange "unexpected character" errors I figured that this was a rather
      important option. I therefore added

      charVocabulary = '\3'..'\377';

      To my Lexer options.

      But I'm still getting unexpected char errors. I have a fairly simple
      grammar with a non-greedy rule to match the contents of a specific
      portion. When the lexer encounters the char '=' in this portion it
      stops saying "unexpected character". If I then add this:

      POINTLESS : '=' ;

      The error goes away, but then it stops on some other char. This
      continues until I've added all the chars not listed in some rule in
      the lexer. So to be sure it seems I will have to explicitly list *all*
      the ASCII characters.

      Grepping through the generated code I could not find a single
      reference to "charVocabulary" or "vocabulary". Is this option broken?

      I'm using Antlr 2.7.4 on Linux (Mandrake 10.0) with Java 1.4.2.

      The lexer definition from the grammar file:

      class QuerySchemaLexer extends Lexer;
      options {
      charVocabulary = '\3'..'\377';
      caseSensitiveLiterals = false;
      }

      RPAREN : ')';
      LPAREN : '(';
      COLON : ':';
      SEMI : ';';
      COMMA : ',';

      ID
      options {
      testLiterals = true;
      paraphrase = "an identifer";
      }
      : ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'_'|'0'..'9')*
      ;


      WS : ( ' '
      | '\t'
      | '\r' '\n' { newline(); }
      | '\n' { newline(); }
      )
      {$setType(Token.SKIP);} //ignore this token
      ;
    • Terence Parr
      ... Yeah, that should be the default. I think for 2.7.5 I ll do that. ... Is it because you have not included a rule to match those chars? Saying all ascii in
      Message 2 of 4 , Dec 13, 2004
        On Dec 13, 2004, at 6:16 AM, André Næss wrote:
        > I'm struggling a bit with charVocabulary. After getting lot's of
        > strange "unexpected character" errors I figured that this was a rather
        > important option. I therefore added
        >
        > charVocabulary = '\3'..'\377';

        Yeah, that should be the default. I think for 2.7.5 I'll do that.

        > To my Lexer options.
        >
        > But I'm still getting unexpected char errors. I have a fairly simple

        Is it because you have not included a rule to match those chars?
        Saying all ascii in the vocab means you must have a rule to handle any
        ascii char that comes in. For example, you have no rule for left
        bracket. If that comes along, you'll get an unexpected char error.

        > grammar with a non-greedy rule to match the contents of a specific
        > portion. When the lexer encounters the char '=' in this portion it
        > stops saying "unexpected character". If I then add this:
        >
        > POINTLESS : '=' ;
        >
        > The error goes away, but then it stops on some other char. This
        > continues until I've added all the chars not listed in some rule in
        > the lexer. So to be sure it seems I will have to explicitly list *all*
        > the ASCII characters.

        Well, yeah. If you want it to ignore stuff you don't have rules for,
        just set filter=true. :)

        Ter
        --
        CS Professor & Grad Director, University of San Francisco
        Creator, ANTLR Parser Generator, http://www.antlr.org
        Cofounder, http://www.jguru.com
        Cofounder, http://www.knowspam.net enjoy email again!
      • Colm McHugh
        Hi Andre, My understanding (and experience) is that you are going to get a lexer exception ( bad character or whatever) for any character that is not
        Message 3 of 4 , Dec 13, 2004
          Hi Andre,

          My understanding (and experience) is that you are
          going to get a lexer exception ("bad character" or
          whatever) for any character that is not explicitly
          used to define a token in your lexer (try defining the
          lower-case letter range of ID as 'a'..'y', and you
          should get an exception if you enter a 'z').

          The charVocabulary is used if you define a token as
          _not_ being a certain character or characters; then
          the charVocabulary is used to determine the set of
          characters the token can be.

          The classic case is a STRING token, the text of which
          is often defined as "anything except the quote
          character". What this really means is 'any
          charVocabulary character except a quote'. If you
          didn't specify a charVocabulary set, then your
          charVocabulary would be the set of characters
          explicitly used to define the tokens in your lexer.

          Hope this helps,
          Colm.

          >
          >
          > I'm struggling a bit with charVocabulary. After
          > getting lot's of
          > strange "unexpected character" errors I figured that
          > this was a rather
          > important option. I therefore added
          >
          > charVocabulary = '\3'..'\377';
          >
          > To my Lexer options.
          >
          > But I'm still getting unexpected char errors. I have
          > a fairly simple
          > grammar with a non-greedy rule to match the contents
          > of a specific
          > portion. When the lexer encounters the char '=' in
          > this portion it
          > stops saying "unexpected character". If I then add
          > this:
          >
          > POINTLESS : '=' ;
          >
          > The error goes away, but then it stops on some other
          > char. This
          > continues until I've added all the chars not listed
          > in some rule in
          > the lexer. So to be sure it seems I will have to
          > explicitly list *all*
          > the ASCII characters.
          >
          > Grepping through the generated code I could not find
          > a single
          > reference to "charVocabulary" or "vocabulary". Is
          > this option broken?
          >
          > I'm using Antlr 2.7.4 on Linux (Mandrake 10.0) with
          > Java 1.4.2.
          >
          > The lexer definition from the grammar file:
          >
          > class QuerySchemaLexer extends Lexer;
          > options {
          > charVocabulary = '\3'..'\377';
          > caseSensitiveLiterals = false;
          > }
          >
          > RPAREN : ')';
          > LPAREN : '(';
          > COLON : ':';
          > SEMI : ';';
          > COMMA : ',';
          >
          > ID
          > options {
          > testLiterals = true;
          > paraphrase = "an identifer";
          > }
          > : ('a'..'z'|'A'..'Z'|'_')
          > ('a'..'z'|'A'..'Z'|'_'|'0'..'9')*
          > ;
          >
          >
          > WS : ( ' '
          > | '\t'
          > | '\r' '\n' { newline(); }
          > | '\n' { newline(); }
          > )
          > {$setType(Token.SKIP);} //ignore this token
          > ;
          >
          >
          >
          >
          >
          >
          > Yahoo! Groups Links
          >
          >
          > antlr-interest-unsubscribe@yahoogroups.com
          >
          >
          >
          >
          >
          >




          __________________________________
          Do you Yahoo!?
          Yahoo! Mail - Helps protect you from nasty viruses.
          http://promotions.yahoo.com/new_mail
        • André Næss
          ... Well yes, there was no rule to match the character, not until I added a dummy one anyway. But it was still my understanding that when I use .*, or in my
          Message 4 of 4 , Dec 13, 2004
            On Mon, 13 Dec 2004 11:37:03 -0800, Terence Parr <parrt@...> wrote:

            > > To my Lexer options.
            > >
            > > But I'm still getting unexpected char errors. I have a fairly simple
            >
            > Is it because you have not included a rule to match those chars?
            > Saying all ascii in the vocab means you must have a rule to handle any
            > ascii char that comes in. For example, you have no rule for left
            > bracket. If that comes along, you'll get an unexpected char error.

            Well yes, there was no rule to match the character, not until I added
            a dummy one anyway. But it was still my understanding that when I use
            .*, or in my case:

            (
            options {
            greedy=false;
            }
            : .
            )*

            charVocabulary would be used. Useful when you're matching strings for
            example. In my case I want to discard that which is matched by the .*,
            but what if I wanted to keep it?

            > > The error goes away, but then it stops on some other char. This
            > > continues until I've added all the chars not listed in some rule in
            > > the lexer. So to be sure it seems I will have to explicitly list *all*
            > > the ASCII characters.
            >
            > Well, yeah. If you want it to ignore stuff you don't have rules for,
            > just set filter=true. :)

            What I'm trying to achieve is to match a function prototype but ignore
            it's body. Functions are declared using FUNCTION .. END_FUNCTION
            pairs, so I use the above non-greedy (.)* construct to match the body.
            But filter seems to work and is probably a much better idea, thanks
            for the tip!

            Regards
            André Næss
          Your message has been successfully submitted and would be delivered to recipients shortly.