The vocabulary problem
> When you get into multiple tree parsers and start using the tokdef option,OK, I have been waiting for this issue to crop up so I could ask about a
> you will undoubtedly change your grammar which will change the token types
> file. Then if you don't rebuild all of your tree parsers you will have
> errors where the token type doesn't match what the parser had. All of a
> sudden the INT from the parser is now the STRING in the tree walker, etc.
> It may look like your parser is building wacky trees, but it's just that the
> token types no longer match. Rebuild all your tree parsers and you'll be
solution. A team of people here have written a translator, and in the
process have ~20 tree walker files, mostly to walk small constructs in
Language A and transform them into small trees in Language B, using
symbol-tables and other structures.
We have had the problem of specifying the vocabulary via tokdefs as
mentioned above, and getting number-mismatches as a result of the parser
using one set of tokens and the tree parsers using another. As you may
imagine, we *really* don't want to have to re-build all of the 20 tree
parsers [sloooowwwww]. So we made a "master" vocabulary text file by
hand, based on the vocabulary of the lexer and parser for Language A, and
we then extended it when we knew all the tokens in language B.
Problem: if we don't want to do the normal "chaining" style of token
definitions, where each successive tree parser tokdef's the vocab of it's
predecessor, then we need a master vocabulary that has to respect what the
lexer and parser defines. If we even add one rule to either the lexer or
the grammar, it involves painfully updating all the numbers in the master
Suggestion: allow multiple files to be tokdef'ed at once (checking for
number clashes). This would let tree-walkers tokdef in the parser vocab
and separately maintain a file for the tokens of language B. If we used
really high numbers for the Language B vocab, we would not clash with any
of the numbers in the Language A parser vocab. Then we could change
Language A's grammar at will, and also incrementally grow the Language B
vocab at will as we invent the structure of the target trees.
- -----Original Message-----
From: Steven Atkinson <atkinson@...>
To: firstname.lastname@example.org <email@example.com>
Date: Thursday, October 01, 1998 10:27 AM
Subject: [antlr-interest] The vocabulary problem
>From: Steven Atkinson <atkinson@...>The technical problems are with the "wildcard" and "not" operators and how
>Suggestion: allow multiple files to be tokdef'ed at once (checking for
>number clashes). This would let tree-walkers tokdef in the parser vocab
>and separately maintain a file for the tokens of language B. If we used
>really high numbers for the Language B vocab, we would not clash with any
>of the numbers in the Language A parser vocab. Then we could change
>Language A's grammar at will, and also incrementally grow the Language B
>vocab at will as we invent the structure of the target trees.
they are translated into lookahead tests, which is dependent on how ranges
and sets of tokens are represented internally. Gaps in the token sequence
might allow invalid tokens for a wildcard, but that's not likely to happen
in practice I guess, who would be generating invalid tokens? You would
still have to recompile every module that used a tokdef file that changed...
But I see what you're getting at. It might be appropriate to have a debug
generation mode that wouldn't let any "not" operators into a case statement
or bitset, they would have to be done using if statements and !=.
I guess if I were faced with this problem, I would add 20 or so tokens at
the end of the master vocabulary named UNUSED01-20. Then as I needed a new
token I would replace UNUSEDxx with the token name I want to use. I only
have to regenerate the files that use the new token. Not and wildcard
operators would still work, they don't care about the symbolic name, just
that a token with that number exists or not.
When I used up the last one and need more tokens is when everything would
have to be regenerated and recompiled to accommodate the new token numbers.