Loading ...
Sorry, an error occurred while loading the content.

RE: [antlr-interest] Lexing problem

Expand Messages
  • mzukowski@yci.com
    Try this: STRING: ~( | # ) CODESCAPE | ; CODESCAPE: # ~( | # ) STRING | # ; You might need to alter it to handle escape characters if it has
    Message 1 of 6 , Jun 5 8:29 AM
    • 0 Attachment
      Try this:

      STRING: '"' ~('"' | '#') CODESCAPE | '"';
      CODESCAPE: '#' ~('"' | '#') STRING | '#';

      You might need to alter it to handle escape characters if it has them like
      C's \"

      Monty

      -----Original Message-----
      From: Jim Irwin [mailto:jimirwin@...]
      Sent: Wednesday, June 04, 2003 4:58 PM
      To: antlr-interest@yahoogroups.com
      Subject: [antlr-interest] Lexing problem


      Hi, I'm new to Antlr, and I have a problem for which I would welcome
      suggestions. I'm trying to parse ColdFusion code, and the language
      allows strings to contain expressions. The syntax is roughly the
      following: varname = "... #expression_1# ..." where the hash marks
      enclose a ColdFusion expression that is evaluated and substituted
      into the string at runtime.

      The real problem is that the embedded expression is itself allowed
      to contain strings, so that a single source-code string may look
      like the following:

      "...#iif("a" gt "#b#", "cat", "dog")#..."

      My problem is that I cannot think of a way to define a lexical rule
      that would recognize such a complex string. In principle, the
      string should be parsed. I can conceive of the lexer returning a
      token representing the entire string to the parser, and the parser
      then recursively lexing and parsing the string value until there are
      no more embedded hash-expressions.

      I have no clue as to how I should proceed. In order to lex the
      string, I seem to need a specialized routine that looks ahead, keeps
      track of nested expressions and their strings, and terminates only
      when the matching end quote outside of all expressions is
      encountered.

      Any suggestions?




      Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
    • mzukowski@yci.com
      I forgot the closeure. Make it: STRING: (~( | # ))* CODESCAPE | ; CODESCAPE: # (~( | # ))* STRING | # ; Monty ... From: mzukowski@yci.com
      Message 2 of 6 , Jun 5 8:33 AM
      • 0 Attachment
        I forgot the closeure. Make it:

        STRING: '"' (~('"' | '#'))* CODESCAPE | '"';
        CODESCAPE: '#' (~('"' | '#'))* STRING | '#';

        Monty

        -----Original Message-----
        From: mzukowski@... [mailto:mzukowski@...]
        Sent: Thursday, June 05, 2003 8:30 AM
        To: antlr-interest@yahoogroups.com
        Subject: RE: [antlr-interest] Lexing problem


        Try this:

        STRING: '"' ~('"' | '#') CODESCAPE | '"';
        CODESCAPE: '#' ~('"' | '#') STRING | '#';

        You might need to alter it to handle escape characters if it has them like
        C's \"

        Monty

        -----Original Message-----
        From: Jim Irwin [mailto:jimirwin@...]
        Sent: Wednesday, June 04, 2003 4:58 PM
        To: antlr-interest@yahoogroups.com
        Subject: [antlr-interest] Lexing problem


        Hi, I'm new to Antlr, and I have a problem for which I would welcome
        suggestions. I'm trying to parse ColdFusion code, and the language
        allows strings to contain expressions. The syntax is roughly the
        following: varname = "... #expression_1# ..." where the hash marks
        enclose a ColdFusion expression that is evaluated and substituted
        into the string at runtime.

        The real problem is that the embedded expression is itself allowed
        to contain strings, so that a single source-code string may look
        like the following:

        "...#iif("a" gt "#b#", "cat", "dog")#..."

        My problem is that I cannot think of a way to define a lexical rule
        that would recognize such a complex string. In principle, the
        string should be parsed. I can conceive of the lexer returning a
        token representing the entire string to the parser, and the parser
        then recursively lexing and parsing the string value until there are
        no more embedded hash-expressions.

        I have no clue as to how I should proceed. In order to lex the
        string, I seem to need a specialized routine that looks ahead, keeps
        track of nested expressions and their strings, and terminates only
        when the matching end quote outside of all expressions is
        encountered.

        Any suggestions?




        Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/




        Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
      • Jim Irwin
        ... Thanks, it seems to work quite well. I changed the rules slightly to STRING: (~( | # ) | CODESCAPE)* ; CODESCAPE: # (~( | # ) | STRING)*
        Message 3 of 6 , Jun 5 10:15 AM
        • 0 Attachment
          --- In antlr-interest@yahoogroups.com, mzukowski@y... wrote:
          >
          > STRING: '"' (~('"' | '#'))* CODESCAPE | '"';
          > CODESCAPE: '#' (~('"' | '#'))* STRING | '#';
          >
          > Monty

          Thanks, it seems to work quite well. I changed the rules slightly to
          STRING: '"' (~('"' | '#') | CODESCAPE)* '"';
          CODESCAPE: '#' (~('"' | '#') | STRING)* '#';
        • mzukowski@yci.com
          Ah, yes. Good catch. By the way, is this an open source grammar you re working on? Just curious. Monty ... From: Jim Irwin [mailto:jimirwin@acm.org] Sent:
          Message 4 of 6 , Jun 5 10:20 AM
          • 0 Attachment
            Ah, yes. Good catch.

            By the way, is this an open source grammar you're working on? Just curious.

            Monty

            -----Original Message-----
            From: Jim Irwin [mailto:jimirwin@...]
            Sent: Thursday, June 05, 2003 10:16 AM
            To: antlr-interest@yahoogroups.com
            Subject: [antlr-interest] Re: Lexing problem


            --- In antlr-interest@yahoogroups.com, mzukowski@y... wrote:
            >
            > STRING: '"' (~('"' | '#'))* CODESCAPE | '"';
            > CODESCAPE: '#' (~('"' | '#'))* STRING | '#';
            >
            > Monty

            Thanks, it seems to work quite well. I changed the rules slightly to
            STRING: '"' (~('"' | '#') | CODESCAPE)* '"';
            CODESCAPE: '#' (~('"' | '#') | STRING)* '#';






            Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
          • Jim Irwin
            ... Just curious. ... If I arrive at a moderately useful parser or set of parsers, I would consider making it open source. The question becomes: would enough
            Message 5 of 6 , Jun 6 5:18 AM
            • 0 Attachment
              --- In antlr-interest@yahoogroups.com, mzukowski@y... wrote:
              > By the way, is this an open source grammar you're working on?
              Just curious.
              >

              If I arrive at a moderately useful parser or set of parsers, I would
              consider making it open source. The question becomes: would enough
              people be interested in it to make it worth the effort to publish it?

              My goal is to do some automated code metrics on ColdFusion code.
              That means being able to recognize the use of session, request and
              client scope variables, measuring the level of indirection in
              controlling the flow of execution, measuring the degree of coupling
              between templates, and other measures of complexity. Because of the
              nature of the language, it means parsing ColdFusion tags, ColdFusion
              script, HTML tags, and JavaScript all entangled in the source files.

              There is also a methodology for coding applications called FuseBox
              (somewhat analogous to Java struts) that if done badly, results in
              horribly obfuscated code. Unfortunately from what I've seen so far
              in our company, it seems to be done badly more often than not. I'm
              hoping to come up with a code analyzer that can point out the
              features that result in hard-to-read and hard-to-maintain code.
            Your message has been successfully submitted and would be delivered to recipients shortly.