Loading ...
Sorry, an error occurred while loading the content.

11952Re: Lexer - length/position as token delimiter?

Expand Messages
  • angrymongoose
    May 1, 2004
    • 0 Attachment
      Hello Mark,

      The fields making up a tag are defined in the grammar so I am following your suggestion
      and having some of the `lexical analysis' performed by the parser.

      I implemented a subset of the grammar in order to parse one message as proof of concept
      and I am pretty happy with the results. However, because the parser is doing a lot of the
      work, which ideally would be done by the lexical analyzer, we are concerned about
      performance overhead.

      I will complete the grammar for our sample message type and run a batch of messages
      through it to get an idea of the performance.

      Thanks for your help,

      Norman


      --- In antlr-interest@yahoogroups.com, Mark Lentczner <markl@g...> wrote:
      > As offen is the case, the problems are with your grammar, not the
      > ability to lex or parse it.
      >
      > > :23B:CRED
      > > :32A:000612USD5443,99
      > > :33B:USD5443,99
      >
      > Does the grammar know from the tag what the format of the tag body
      > should be? Or can any tag have any tag _body format? If the later is
      > the case, then the grammar is almost certainly inherently ambiguous and
      > you won't be able to get far. (Unless the tag_body formats are far
      > more restricted than I'm guessing from your example.)
      >
      > Here's an example:
      >
      > :33X:12040678,99
      >
      > Unless the grammar says something about tag "33X", there is no way to
      > know if this is should be parsed as:
      > 1) a date, "120406" and an amount "78,99"
      > or 2) an amount "12040678,99"
      >
      > Assuming there is a way to know from the tag what to expect from the
      > tag_body, then I'd approach this by putting most of the work in the
      > parser, not the lexer.
      >
      > In the lexer I'd have:
      >
      > class ScriptLexer extends Lexer;
      > options { testLiterals = false; }
      >
      > TAG options{testLiterals=true;}: ':' DIGIT DIGIT LETTER ':';
      > DIGIT: '0'..'9';
      > COMMA: ',';
      > LETTER: 'A'..'Z';
      >
      > In the parser I'd define rules for each tag_body format:
      >
      > transaction: (LETTER)+;
      > date: DIGIT DIGIT DIGIT DIGIT DIGIT DIGIT;
      > currency: LETTER LETTER LETTER;
      > value: (DIGIT)+ (COMMA (DIGIT)+)?;
      > amount: currency value;
      > dated_amount: date amount;
      >
      > Then each I'd run the rest of the parser like:
      >
      > message : headers entry+ trailer;
      > line : (
      > ":23B:" transaction
      > | ":32A:" dated_amount
      > | ":33B:" amount
      > );
      >
      > Notice the trick of allowing the literal test in the TAG rule, and then
      > using all the tag names as literals in the parser.
      >
      > - Mark
      >
      > Mark Lentczner
      > markl@w...
      > http://www.wheatfarm.org/
    • Show all 5 messages in this topic