Loading ...
Sorry, an error occurred while loading the content.

Re: Lexer - length/position as token delimiter?

Expand Messages
  • angrymongoose
    Hello Mark, The fields making up a tag are defined in the grammar so I am following your suggestion and having some of the `lexical analysis performed by the
    Message 1 of 5 , May 1, 2004
    • 0 Attachment
      Hello Mark,

      The fields making up a tag are defined in the grammar so I am following your suggestion
      and having some of the `lexical analysis' performed by the parser.

      I implemented a subset of the grammar in order to parse one message as proof of concept
      and I am pretty happy with the results. However, because the parser is doing a lot of the
      work, which ideally would be done by the lexical analyzer, we are concerned about
      performance overhead.

      I will complete the grammar for our sample message type and run a batch of messages
      through it to get an idea of the performance.

      Thanks for your help,

      Norman


      --- In antlr-interest@yahoogroups.com, Mark Lentczner <markl@g...> wrote:
      > As offen is the case, the problems are with your grammar, not the
      > ability to lex or parse it.
      >
      > > :23B:CRED
      > > :32A:000612USD5443,99
      > > :33B:USD5443,99
      >
      > Does the grammar know from the tag what the format of the tag body
      > should be? Or can any tag have any tag _body format? If the later is
      > the case, then the grammar is almost certainly inherently ambiguous and
      > you won't be able to get far. (Unless the tag_body formats are far
      > more restricted than I'm guessing from your example.)
      >
      > Here's an example:
      >
      > :33X:12040678,99
      >
      > Unless the grammar says something about tag "33X", there is no way to
      > know if this is should be parsed as:
      > 1) a date, "120406" and an amount "78,99"
      > or 2) an amount "12040678,99"
      >
      > Assuming there is a way to know from the tag what to expect from the
      > tag_body, then I'd approach this by putting most of the work in the
      > parser, not the lexer.
      >
      > In the lexer I'd have:
      >
      > class ScriptLexer extends Lexer;
      > options { testLiterals = false; }
      >
      > TAG options{testLiterals=true;}: ':' DIGIT DIGIT LETTER ':';
      > DIGIT: '0'..'9';
      > COMMA: ',';
      > LETTER: 'A'..'Z';
      >
      > In the parser I'd define rules for each tag_body format:
      >
      > transaction: (LETTER)+;
      > date: DIGIT DIGIT DIGIT DIGIT DIGIT DIGIT;
      > currency: LETTER LETTER LETTER;
      > value: (DIGIT)+ (COMMA (DIGIT)+)?;
      > amount: currency value;
      > dated_amount: date amount;
      >
      > Then each I'd run the rest of the parser like:
      >
      > message : headers entry+ trailer;
      > line : (
      > ":23B:" transaction
      > | ":32A:" dated_amount
      > | ":33B:" amount
      > );
      >
      > Notice the trick of allowing the literal test in the TAG rule, and then
      > using all the tag names as literals in the parser.
      >
      > - Mark
      >
      > Mark Lentczner
      > markl@w...
      > http://www.wheatfarm.org/
    • Mark Lentczner
      ... Never prematurely optimize, I always say. If your examples are at all indicative, I d be surprised if there was any significant timing differences between
      Message 2 of 5 , May 1, 2004
      • 0 Attachment
        > However, because the parser is doing a lot of the work, which ideally
        > would be done by the lexical analyzer, we are concerned about
        > performance overhead.
        Never prematurely optimize, I always say. If your examples are at all
        indicative, I'd be surprised if there was any significant timing
        differences between approaches.

        Unless, of course, you need to process some hugh number of records very
        quickly (relative to the target hardware), in which case you may need
        to make more drastic changes (Java -> C++ if you haven't already, or
        getting rid of Antlr and using a hand built lexer/parser pair.)

        > I will complete the grammar for our sample message type and run a
        > batch of messages
        > through it to get an idea of the performance.
        That's the way to go. If the timing doesn't meet some object measure
        of "fast enough" (which was determined **before** you ran the tests,
        yes?), then be sure to use a performance tool to see where the
        bottleneck is. I wouldn't just assume that it is in the parser and
        that a more complicated scheme that moved the rules into the lexer
        would speed things up...

        - Mark

        Mark Lentczner
        markl@...
        http://www.wheatfarm.org/
      Your message has been successfully submitted and would be delivered to recipients shortly.