• ## RE: Problem recognizing 'OR' vs 'ORGANISM'

(4)
• NextPrevious
• I think you can solve this with something like: OR ... ( ( GANISM ) { _ttype = ORGANISM; } ... ) ; Also if k is set to be 3 in this case the OR and ORG
Message 1 of 4 , Dec 1, 1999
View Source
• 0 Attachment
I think you can solve this with something like:

OR
: "OR"
(
("GANISM") { _ttype = ORGANISM; }
| // nothing, stick with OR
)
;

Also if k is set to be 3 in this case the "OR" and "ORG" will be recognized
as different.

As a related problem what if I set k = 3 and I add another word "ORGANIC".
My first try would be

OR
: "OR"
(
("GANISM") { _ttype = ORGANISM; }
("GANIC") { _ttype = ORGANIC; }
| // nothing, stick with OR
)
;

The problem is that I get a lexical nondeterminism warning because "GANISM"
and "GANIC" are not unique to 3 or less characters. I could try to increase
k but I would rather not if I have only a few such words. How do I handle
that?

Kevin Burton
Kevin.Burton@...

> -----Original Message-----
> From: Gould, Jack [SMTP:jgould@...]
> Sent: Wednesday, November 24, 1999 9:03 AM
> To: 'antlr-interest@onelist.com'
> Subject: [antlr-interest] Problem recognizing 'OR' vs 'ORGANISM'
>
> From: "Gould, Jack" <jgould@...>
>
> Hello,
>
> I have a fairly simple grammar, but one of the complexities is that
> I have to be able to recognize '||' and 'OR' as a logical or. The problem
> is, when a valid identifier starts with 'OR', the lexer returns the LOR
> token and does not recognize the identifier.
>
> Can anyone offer guidance? I'm used to the maximum munch styles of
> LEX and FLEX++.
>
> Thanks!
>
> - Jack Gould
> Cleveland, OH
>
> Token section of my grammar:
>
> //------------------------------------------------------------------------
> --
> --
> // LEXER
> //------------------------------------------------------------------------
> --
> --
> class SynergyQueryLexer extends Lexer;
>
> options
> {
> tokenVocabulary = SynQL; // call the vocabulary "SynQL"
> testLiterals = false; // don't automatically test for literals
> k = 4; // four characters of lookahead
> }
>
>
>
> // OPERATORS
> LPAREN
> : '(' ;
> RPAREN
> : ')' ;
> EQ
> : '=' { _ttype = EQ; }
> ( '=' )?
> | "EQUAL" ;
> LNOT
> : '!' ;
> BNOT
> : '~' ;
> NE
> : "!=" | "NOTEQUAL" ;
> GE
> : ">=" ;
> GT
> : ">" ;
> LE
> : '<' { _ttype = LT; }
> ( '=' { _ttype = LE; }
> | '>' { _ttype = NE; }
> )?
> ;
> LOR
> : "||" | ('O'|'o') ('R'|'r') ;
> LAND
> : "&&" | ('A'|'a') ('N'|'n') ('D'|'d') ;
>
>
> // Whitespace -- ignored
> WS
> : ( ' '
> | '\t'
> | '\f'
> // handle newlines
> | ( "\r\n" // Evil DOS
> | '\r' // Macintosh
> | '\n' // Unix (the right way)
> )
> { newline(); }
> )
> { _ttype = Token.SKIP; }
> ;
>
>
> SL_COMMENT
> : "//" (~'\n')* '\n'
> { _ttype = Token.SKIP; newline(); }
> ;
>
>
> ML_COMMENT
> : "/*"
> ( { LA(2)!='/' }? '*'
> | '\n' { newline(); }
> | ~('*'|'\n')
> )*
> "*/"
> { _ttype = Token.SKIP; }
> ;
>
>
> // character literals
> CHAR_LITERAL
> : '\''! (( ESC | ~'\'' ))* '\''!
> ;
>
> // string literals
> STRING_LITERAL
> : '"'! (ESC|~('"'|'\\'))* '"'!
> ;
>
>
> // escape sequence -- note that this is protected; it can only be called
> // from another lexer rule -- it will not ever directly return a token
> to
> // the parser
> // There are various ambiguities hushed in this rule. The optional
> // '0'...'9' digit matches should be matched here rather than letting
> // them go back to STRING_LITERAL to be matched. ANTLR does the
> // right thing by matching immediately; hence, it's ok to shut off
> // the FOLLOW ambig warnings.
> protected
> ESC
> : '\\'
> ( 'n'
> | 'r'
> | 't'
> | 'b'
> | 'f'
> | '"'
> | '\''
> | '\\'
> | ('u')+ HEX_DIGIT HEX_DIGIT HEX_DIGIT HEX_DIGIT
> | ('0'..'3')
> (
> options {
> warnWhenFollowAmbig = false;
> }
> : ('0'..'9')
> (
> options {
> warnWhenFollowAmbig = false;
> }
> : '0'..'9'
> )?
> )?
> | ('4'..'7')
> (
> options {
> warnWhenFollowAmbig = false;
> }
> : ('0'..'9')
> )?
> )
> ;
>
>
> // hexadecimal digit (again, note it's protected!)
> protected
> HEX_DIGIT
> : ('0'..'9'|'A'..'F'|'a'..'f')
> ;
>
>
> // a dummy rule to force vocabulary to be all characters (except special
> // ones that ANTLR uses internally (0 to 2)
> protected
> VOCAB
> : '\3'..'\377'
> ;
>
>
> // an identifier. Note that testLiterals is set to true! This means
> // that after we match the rule, we look in the literals table to see
> // if it's a literal or really an identifer
> IDENT
> options {testLiterals=true;}
> : ('a'..'z'|'A'..'Z'|'_'|'\$')
> ('a'..'z'|'A'..'Z'|'_'|'0'..'9'|'\$')*
> ;
>
>
> // a numeric literal
> NUM_INT
> {boolean isDecimal=false;}
> : '.' {_ttype = DOT;}
> (('0'..'9')+ (EXPONENT)? (FLOAT_SUFFIX)? { _ttype =
> NUM_FLOAT; })?
> | ( '0' {isDecimal = true;} // special case for just '0'
> ( ('x'|'X')
> (
> // hex
> // the 'e'|'E' and float suffix
> stuff look
> // like hex digits, hence the (...)+
> doesn't
> // know when to stop: ambig. ANTLR
> resolves
> // it correctly by matching
> immediately. It
> // is therefor ok to hush warning.
> options {
> warnWhenFollowAmbig=false;
> }
> : HEX_DIGIT
> )+
> | ('0'..'7')+
> // octal
> )?
> | ('1'..'9') ('0'..'9')* {isDecimal=true;}
> // non-zero decimal
> )
> ( ('l'|'L')
>
> // only check to see if it's a float if looks like decimal
> so far
> | {isDecimal}?
> ( '.' ('0'..'9')* (EXPONENT)? (FLOAT_SUFFIX)?
> | EXPONENT (FLOAT_SUFFIX)?
> | FLOAT_SUFFIX
> )
> { _ttype = NUM_FLOAT; }
> )?
> ;
>
>
> // a couple protected methods to assist in matching floating point numbers
> protected
> EXPONENT
> : ('e'|'E') ('+'|'-')? ('0'..'9')+
> ;
>
>
> protected
> FLOAT_SUFFIX
> : 'f'|'F'|'d'|'D'
> ;
>
>
• ... OR ... ( ( GANISM ) = ( GANISM ) { _ttype = ORGANISM; } ... ) ;
Message 2 of 4 , Dec 1, 1999
View Source
• 0 Attachment
kevin.burton@... wrote:
>>
> OR
> : "OR"
> (
> ("GANISM") { _ttype = ORGANISM; }
> ("GANIC") { _ttype = ORGANIC; }
> | // nothing, stick with OR
> )
> ;
>
>>
OR
: "OR"
(
("GANISM") => ("GANISM") { _ttype = ORGANISM; }
| ("GANIC") => ("GANIC") { _ttype = ORGANIC; }
| // nothing, stick with OR
)
;
• Jack, Wouldn t it be easiest to simply extend the IDENT rule to check for these? I m not sure of the exact syntax (I don t know java that well), but the
Message 3 of 4 , Dec 1, 1999
View Source
• 0 Attachment
Jack,

Wouldn't it be easiest to simply extend the IDENT rule to check for these?
I'm not sure of the exact syntax (I don't know java that well), but the
general idea would be something like:

IDENT
options {testLiterals=true;}
: ('a'..'z'|'A'..'Z'|'_'|'\$')
('a'..'z'|'A'..'Z'|'_'|'0'..'9'|'\$')*
{ if (\$getText == "or") \$setType(OR)
else if (\$getText == "and") \$setType(AND)
. . etc . .
else \$setType(IDENT);
}
;

Now, if this does the job for you, then note you're essentially duplicating
the function of the testLiterals option and you could just directly specify
your logical tokens. That is, leave the IDENT rule as it is and add
something like

tokens {
OR="or";
AND="and";
. . etc . .
}

to your .g file. The testLiteral=true processing should then set the token
type as desired.

Ken

-----Original Message-----
From: Gould, Jack [mailto:jgould@...]
Sent: Wednesday, November 24, 1999 7:03 AM
To: 'antlr-interest@onelist.com'
Subject: [antlr-interest] Problem recognizing 'OR' vs 'ORGANISM'

From: "Gould, Jack" <jgould@...>

Hello,

I have a fairly simple grammar, but one of the complexities is that
I have to be able to recognize '||' and 'OR' as a logical or. The problem
is, when a valid identifier starts with 'OR', the lexer returns the LOR
token and does not recognize the identifier.

Can anyone offer guidance? I'm used to the maximum munch styles of
LEX and FLEX++.

Thanks!

- Jack Gould
Cleveland, OH

Token section of my grammar:

//--------------------------------------------------------------------------
--
// LEXER
//--------------------------------------------------------------------------
--
class SynergyQueryLexer extends Lexer;

options
{
tokenVocabulary = SynQL; // call the vocabulary "SynQL"
testLiterals = false; // don't automatically test for literals
k = 4; // four characters of lookahead
}

// OPERATORS
LPAREN
: '(' ;
RPAREN
: ')' ;
EQ
: '=' { _ttype = EQ; }
( '=' )?
| "EQUAL" ;
LNOT
: '!' ;
BNOT
: '~' ;
NE
: "!=" | "NOTEQUAL" ;
GE
: ">=" ;
GT
: ">" ;
LE
: '<' { _ttype = LT; }
( '=' { _ttype = LE; }
| '>' { _ttype = NE; }
)?
;
LOR
: "||" | ('O'|'o') ('R'|'r') ;
LAND
: "&&" | ('A'|'a') ('N'|'n') ('D'|'d') ;

// Whitespace -- ignored
WS
: ( ' '
| '\t'
| '\f'
// handle newlines
| ( "\r\n" // Evil DOS
| '\r' // Macintosh
| '\n' // Unix (the right way)
)
{ newline(); }
)
{ _ttype = Token.SKIP; }
;

SL_COMMENT
: "//" (~'\n')* '\n'
{ _ttype = Token.SKIP; newline(); }
;

ML_COMMENT
: "/*"
( { LA(2)!='/' }? '*'
| '\n' { newline(); }
| ~('*'|'\n')
)*
"*/"
{ _ttype = Token.SKIP; }
;

// character literals
CHAR_LITERAL
: '\''! (( ESC | ~'\'' ))* '\''!
;

// string literals
STRING_LITERAL
: '"'! (ESC|~('"'|'\\'))* '"'!
;

// escape sequence -- note that this is protected; it can only be called
// from another lexer rule -- it will not ever directly return a token to
// the parser
// There are various ambiguities hushed in this rule. The optional
// '0'...'9' digit matches should be matched here rather than letting
// them go back to STRING_LITERAL to be matched. ANTLR does the
// right thing by matching immediately; hence, it's ok to shut off
protected
ESC
: '\\'
( 'n'
| 'r'
| 't'
| 'b'
| 'f'
| '"'
| '\''
| '\\'
| ('u')+ HEX_DIGIT HEX_DIGIT HEX_DIGIT HEX_DIGIT
| ('0'..'3')
(
options {
warnWhenFollowAmbig = false;
}
: ('0'..'9')
(
options {
warnWhenFollowAmbig = false;
}
: '0'..'9'
)?
)?
| ('4'..'7')
(
options {
warnWhenFollowAmbig = false;
}
: ('0'..'9')
)?
)
;

// hexadecimal digit (again, note it's protected!)
protected
HEX_DIGIT
: ('0'..'9'|'A'..'F'|'a'..'f')
;

// a dummy rule to force vocabulary to be all characters (except special
// ones that ANTLR uses internally (0 to 2)
protected
VOCAB
: '\3'..'\377'
;

// an identifier. Note that testLiterals is set to true! This means
// that after we match the rule, we look in the literals table to see
// if it's a literal or really an identifer
IDENT
options {testLiterals=true;}
: ('a'..'z'|'A'..'Z'|'_'|'\$')
('a'..'z'|'A'..'Z'|'_'|'0'..'9'|'\$')*
;

// a numeric literal
NUM_INT
{boolean isDecimal=false;}
: '.' {_ttype = DOT;}
(('0'..'9')+ (EXPONENT)? (FLOAT_SUFFIX)? { _ttype =
NUM_FLOAT; })?
| ( '0' {isDecimal = true;} // special case for just '0'
( ('x'|'X')
(
// hex
// the 'e'|'E' and float suffix
stuff look
// like hex digits, hence the (...)+
doesn't
// know when to stop: ambig. ANTLR
resolves
// it correctly by matching
immediately. It
// is therefor ok to hush warning.
options {
warnWhenFollowAmbig=false;
}
: HEX_DIGIT
)+
| ('0'..'7')+
// octal
)?
| ('1'..'9') ('0'..'9')* {isDecimal=true;}
// non-zero decimal
)
( ('l'|'L')

// only check to see if it's a float if looks like decimal
so far
| {isDecimal}?
( '.' ('0'..'9')* (EXPONENT)? (FLOAT_SUFFIX)?
| EXPONENT (FLOAT_SUFFIX)?
| FLOAT_SUFFIX
)
{ _ttype = NUM_FLOAT; }
)?
;

// a couple protected methods to assist in matching floating point numbers
protected
EXPONENT
: ('e'|'E') ('+'|'-')? ('0'..'9')+
;

protected
FLOAT_SUFFIX
: 'f'|'F'|'d'|'D'
;
Your message has been successfully submitted and would be delivered to recipients shortly.