Loading ...
Sorry, an error occurred while loading the content.
 

The vocabulary problem

Expand Messages
  • Steven Atkinson
    ... OK, I have been waiting for this issue to crop up so I could ask about a solution. A team of people here have written a translator, and in the process
    Message 1 of 2 , Oct 1, 1998
      > When you get into multiple tree parsers and start using the tokdef option,
      > you will undoubtedly change your grammar which will change the token types
      > file. Then if you don't rebuild all of your tree parsers you will have
      > errors where the token type doesn't match what the parser had. All of a
      > sudden the INT from the parser is now the STRING in the tree walker, etc.
      > It may look like your parser is building wacky trees, but it's just that the
      > token types no longer match. Rebuild all your tree parsers and you'll be
      > ok.
      >
      > Monty

      OK, I have been waiting for this issue to crop up so I could ask about a
      solution. A team of people here have written a translator, and in the
      process have ~20 tree walker files, mostly to walk small constructs in
      Language A and transform them into small trees in Language B, using
      symbol-tables and other structures.

      We have had the problem of specifying the vocabulary via tokdefs as
      mentioned above, and getting number-mismatches as a result of the parser
      using one set of tokens and the tree parsers using another. As you may
      imagine, we *really* don't want to have to re-build all of the 20 tree
      parsers [sloooowwwww]. So we made a "master" vocabulary text file by
      hand, based on the vocabulary of the lexer and parser for Language A, and
      we then extended it when we knew all the tokens in language B.

      Problem: if we don't want to do the normal "chaining" style of token
      definitions, where each successive tree parser tokdef's the vocab of it's
      predecessor, then we need a master vocabulary that has to respect what the
      lexer and parser defines. If we even add one rule to either the lexer or
      the grammar, it involves painfully updating all the numbers in the master
      vocabulary file.

      Suggestion: allow multiple files to be tokdef'ed at once (checking for
      number clashes). This would let tree-walkers tokdef in the parser vocab
      and separately maintain a file for the tokens of language B. If we used
      really high numbers for the Language B vocab, we would not clash with any
      of the numbers in the Language A parser vocab. Then we could change
      Language A's grammar at will, and also incrementally grow the Language B
      vocab at will as we invent the structure of the target trees.


      Cheers,
      Steve
    • Monty Zukowski
      ... From: Steven Atkinson To: antlr-interest@onelist.com Date: Thursday, October 01, 1998 10:27 AM
      Message 2 of 2 , Oct 1, 1998
        -----Original Message-----
        From: Steven Atkinson <atkinson@...>
        To: antlr-interest@onelist.com <antlr-interest@onelist.com>
        Date: Thursday, October 01, 1998 10:27 AM
        Subject: [antlr-interest] The vocabulary problem


        >From: Steven Atkinson <atkinson@...>
        >
        >Suggestion: allow multiple files to be tokdef'ed at once (checking for
        >number clashes). This would let tree-walkers tokdef in the parser vocab
        >and separately maintain a file for the tokens of language B. If we used
        >really high numbers for the Language B vocab, we would not clash with any
        >of the numbers in the Language A parser vocab. Then we could change
        >Language A's grammar at will, and also incrementally grow the Language B
        >vocab at will as we invent the structure of the target trees.


        The technical problems are with the "wildcard" and "not" operators and how
        they are translated into lookahead tests, which is dependent on how ranges
        and sets of tokens are represented internally. Gaps in the token sequence
        might allow invalid tokens for a wildcard, but that's not likely to happen
        in practice I guess, who would be generating invalid tokens? You would
        still have to recompile every module that used a tokdef file that changed...

        But I see what you're getting at. It might be appropriate to have a debug
        generation mode that wouldn't let any "not" operators into a case statement
        or bitset, they would have to be done using if statements and !=.

        I guess if I were faced with this problem, I would add 20 or so tokens at
        the end of the master vocabulary named UNUSED01-20. Then as I needed a new
        token I would replace UNUSEDxx with the token name I want to use. I only
        have to regenerate the files that use the new token. Not and wildcard
        operators would still work, they don't care about the symbolic name, just
        that a token with that number exists or not.

        When I used up the last one and need more tokens is when everything would
        have to be regenerated and recompiled to accommodate the new token numbers.

        Monty
      Your message has been successfully submitted and would be delivered to recipients shortly.