Loading ...
Sorry, an error occurred while loading the content.
 

Re: [antlr-interest] proposal for 2.7.4: charVocabulary defaults to ascii 1..127

Expand Messages
  • Terence Parr
    ... Yeah, i m thinking LATIN (0..254) would be the right approach to start with (start small as they say). I will push out 2.7.4 over the next day or two.
    Message 1 of 26 , May 2, 2004
      On May 2, 2004, at 8:17 AM, Ric Klaren wrote:

      > On Sat, May 01, 2004 at 11:42:40AM -0700, Terence Parr wrote:
      >> Anybody object? I'm seeing this issue come up too many times. So, if
      >> you don't specify, then charVocabulary is set for you to ascii.
      >
      > ASCII or the range 3-254 (extended ascii was it?) I'm not sure how
      > many of
      > the reserved values 0-3 are used still in backends and/or the
      > analyzer....
      > Making unicode default I dunno.. personally I would not do that. It
      > increases
      > the default lexer size (not sure how much it blows up though).

      Yeah, i'm thinking LATIN (0..254) would be the right approach to start
      with (start small as they say). I will push out 2.7.4 over the next
      day or two. Sounds like I need to really think about UNICODE. Can
      easily be added gradually with some point releases.

      Ter
    • Oliver Zeigermann
      I know this is leading astray. So this will be my last post on this matter. ... What the internal representation is, you simply do not know and there is also
      Message 2 of 26 , May 2, 2004
        I know this is leading astray. So this will be my last post on this matter.

        Mike Lischke wrote:

        >>Now you seem to mix something up. Both UTF-16 and UTF-32 are
        >>character encodings as well, just as UTF-8. All of them are
        >>converted to characters before parsing.
        >
        >
        > Sure, but how is the internal representation? Actually, it is UTF-16. So although it is a transformation format it is
        > also the actual character representation. Hence UTF-16 (as well as UTF-32) can be processed directly. UTF-8 has to be
        > converted first to one of these formats (usually, at least). This is what I meant.

        What the internal representation is, you simply do not know and there is
        also no need to know. Certainly, it is not UTF-16 as it only allows for
        64K characters which is far to little.

        Oliver
      • Brian Smith
        ... Oliver, In ANTLR for Java, you do know the representation and for some applications is it important. It is a 16-bit integer described by the char type.
        Message 3 of 26 , May 2, 2004
          Oliver Zeigermann wrote:

          >Mike Lischke wrote:
          >
          >
          >
          >>>Now you seem to mix something up. Both UTF-16 and UTF-32 are
          >>>character encodings as well, just as UTF-8. All of them are
          >>>converted to characters before parsing.
          >>>
          >>>
          >>Sure, but how is the internal representation? Actually, it is UTF-16. So although it is a transformation format it is
          >>also the actual character representation. Hence UTF-16 (as well as UTF-32) can be processed directly. UTF-8 has to be
          >>converted first to one of these formats (usually, at least). This is what I meant.
          >>
          >>
          >
          >What the internal representation is, you simply do not know and there is
          >also no need to know. Certainly, it is not UTF-16 as it only allows for
          >64K characters which is far to little.
          >
          >
          >
          Oliver,

          In ANTLR for Java, you do know the representation and for some
          applications is it important. It is a 16-bit integer described by the
          'char' type. For JRE 1.2-1.4, 'char' is a 16-bit Unicode code point.
          (Unicode 1.x - 3.x depending on the JRE version). In JRE 1.5, 'char' is
          redefined to be a 16-bit Unicode 4.0 code unit, that may represent
          either a whole character (code point), or a partial character that needs
          to be combined with an adjacent one according to the UTF-16
          transformation rules. See http://weblogs.java.net/pub/wlg/1202 and the
          documents it references.

          IMO, in order to fully support Unicode 4.0, ANTLR (for Java) would need
          to replace all usages of 'char' with 'java.lang.String' or 'int.'

          - Brian
        • Mark Lentczner
          Here is my take on Unicode and Antlr. I realize that parts of this have already be stated by other people in this list. I thought it would be good to pull
          Message 4 of 26 , May 2, 2004
            Here is my take on Unicode and Antlr. I realize that parts of this
            have already be stated by other people in this list. I thought it
            would be good to pull together all those ideas and present an approach
            as a cohesive, if way too long, proposal.

            0) Philosophy
            -------------
            There are two clear separations that should guide this design: First,
            character set and character encoding are distinct concepts that must be
            cleanly handled throughout. Second, the semantics of Antlr shouldn't
            depend on the implementation of Antlr. This is especially true since
            Antlr is partially re-implemented for different target languages (Java,
            C++, C# etc...)

            1) Structure
            ------------
            I think a good case can be made for considering all parsing activity in
            Antlr to be in Unicode. The a lexer parses streams of characters into
            tokens. The grammar is described in terms of characters, not encoded
            bytes. (C++ is still C++ even if encoded in EBCDIC). Since Unicode
            encompasses virtually all known characters, defining the characters
            that Antlr lexers read as Unicode covers all bases. (See notes below
            on binary.)

            Handling different character encodings can be left completely to the
            input stream class. If a grammar is to only be applied to US-ASCII or
            ISO-8860-3 characters, than the input stream can be limited to that,
            and map them into Unicode presented to the generated lexer - there is
            no need to make that distinction in the lexer grammar file. On the
            other hand, by specifying the grammar over Unicode, then by simply
            changing the input stream, one can lex the same grammar over US-ASCII,
            ISO-8860-3, UTF-8, or Shift-JIS, etc.

            2) Antlr Features
            -----------------
            The only semantic aspect of Antlr that actually depends on
            charVocabulary is the concept of compliment (element and set). What
            started this thread was Terrance's observation that it is a constant
            source of pitfalls: Currently inversion means "of all the characters
            used in the grammar, not these". Which means that if my grammar only
            mentions 'A'..'Z', and '0'..'9', then "~('0'..'9')" only means
            'A'..'Z'. What most people expect is that "~('0'..'9')" should mean
            ANY character in the input stream except '0'..'9'. Rather than fix
            this by changing the default charVocabulary, a better approach is to
            just to directly change the meaning of compliment to mean what people
            expect it to mean. (See notes below on set inversion).

            Once complement is defined this way, then the charVocabulary option can
            be removed.

            A large range of Unicode based built in character classes has been
            suggested to be added. I see nothing wrong with the proposed syntaxes,
            but I question the utility of all the proposed options. I have yet to
            see a grammar that has a need to exclude particular Unicode blocks, for
            example. On the other hand, some of the Unicode character properties
            are good candidates for inclusion. I think restraint should reign
            here, and Antlr should only implement at first what people will
            actually use.

            3) Implementation
            -----------------
            Since Unicode is no longer limited to 16 bits (and hasn't been for
            quite some time), internally, Antlr should avoid the whole morass of
            surrogate pairs, and simply do all character operations with integers.
            Furthermore, this is exactly what Java 1.5 is going to do, and it is
            really the only viable option in C++ (wchar being what it is).

            In either Java, C# or C++, as implemented on most modern processors,
            there will be no performance difference manipulating 32 signed integers
            vs. 8 unsigned chars in a lexer where they are dealt with one at a
            time. Even the string operations wouldn't be seriously affected since
            most literals in a lexer tend to be short words and will be about as
            efficient as small integer array compares. This also allows all of
            Antlr's internal state values (EOF, etc.) to be disjoint from all
            characters (by using negative values)

            The only major stumbling block to Antlr's use of Unicode internally are
            its bit sets and the need for compliment. In the generated code, the
            use of bit sets is very regular, and a slightly more powerful
            representation could easily support Unicode with complemented sets
            without them always being O(2^20) bits in size. Antlr's use of bit
            sets during the analysis and generation, however, might need some more
            sophisticated bit set class to handle things without simply resorting
            to huge bit maps. I'd be happy to lend some coding effort to make this
            work.

            When Antlr is used to parse binary formats, there is no real harm in
            the internal Unicode interpretation. The input source would only
            happen to supply characters less than 256. That set complements would
            include characters beyond 8 btis wouldn't matter: They'd never be
            presented by the input souce. The only slight trick would be in proper
            handling of 0, which isn't a valid Unicode character. But I don't
            think this would pose much of a problem.

            - Mark


            Mark Lentczner
            markl@...
            http://www.wheatfarm.org/
          • matthew ford
            I agree with all of this. It seems a very clear set of proposals. matthew ... From: Mark Lentczner To:
            Message 5 of 26 , May 3, 2004
              I agree with all of this.
              It seems a very clear set of proposals.
              matthew

              ----- Original Message -----
              From: "Mark Lentczner" <markl@...>
              To: <antlr-interest@yahoogroups.com>
              Sent: Monday, May 03, 2004 2:54 PM
              Subject: Re: [antlr-interest] proposal for 2.7.4, Unicode, and more...


              > Here is my take on Unicode and Antlr. I realize that parts of this
              > have already be stated by other people in this list. I thought it
              > would be good to pull together all those ideas and present an approach
              > as a cohesive, if way too long, proposal.
              >
              > 0) Philosophy
              > -------------
              > There are two clear separations that should guide this design: First,
              > character set and character encoding are distinct concepts that must be
              > cleanly handled throughout. Second, the semantics of Antlr shouldn't
              > depend on the implementation of Antlr. This is especially true since
              > Antlr is partially re-implemented for different target languages (Java,
              > C++, C# etc...)
              >
              > 1) Structure
              > ------------
              > I think a good case can be made for considering all parsing activity in
              > Antlr to be in Unicode. The a lexer parses streams of characters into
              > tokens. The grammar is described in terms of characters, not encoded
              > bytes. (C++ is still C++ even if encoded in EBCDIC). Since Unicode
              > encompasses virtually all known characters, defining the characters
              > that Antlr lexers read as Unicode covers all bases. (See notes below
              > on binary.)
              >
              > Handling different character encodings can be left completely to the
              > input stream class. If a grammar is to only be applied to US-ASCII or
              > ISO-8860-3 characters, than the input stream can be limited to that,
              > and map them into Unicode presented to the generated lexer - there is
              > no need to make that distinction in the lexer grammar file. On the
              > other hand, by specifying the grammar over Unicode, then by simply
              > changing the input stream, one can lex the same grammar over US-ASCII,
              > ISO-8860-3, UTF-8, or Shift-JIS, etc.
              >
              > 2) Antlr Features
              > -----------------
              > The only semantic aspect of Antlr that actually depends on
              > charVocabulary is the concept of compliment (element and set). What
              > started this thread was Terrance's observation that it is a constant
              > source of pitfalls: Currently inversion means "of all the characters
              > used in the grammar, not these". Which means that if my grammar only
              > mentions 'A'..'Z', and '0'..'9', then "~('0'..'9')" only means
              > 'A'..'Z'. What most people expect is that "~('0'..'9')" should mean
              > ANY character in the input stream except '0'..'9'. Rather than fix
              > this by changing the default charVocabulary, a better approach is to
              > just to directly change the meaning of compliment to mean what people
              > expect it to mean. (See notes below on set inversion).
              >
              > Once complement is defined this way, then the charVocabulary option can
              > be removed.
              >
              > A large range of Unicode based built in character classes has been
              > suggested to be added. I see nothing wrong with the proposed syntaxes,
              > but I question the utility of all the proposed options. I have yet to
              > see a grammar that has a need to exclude particular Unicode blocks, for
              > example. On the other hand, some of the Unicode character properties
              > are good candidates for inclusion. I think restraint should reign
              > here, and Antlr should only implement at first what people will
              > actually use.
              >
              > 3) Implementation
              > -----------------
              > Since Unicode is no longer limited to 16 bits (and hasn't been for
              > quite some time), internally, Antlr should avoid the whole morass of
              > surrogate pairs, and simply do all character operations with integers.
              > Furthermore, this is exactly what Java 1.5 is going to do, and it is
              > really the only viable option in C++ (wchar being what it is).
              >
              > In either Java, C# or C++, as implemented on most modern processors,
              > there will be no performance difference manipulating 32 signed integers
              > vs. 8 unsigned chars in a lexer where they are dealt with one at a
              > time. Even the string operations wouldn't be seriously affected since
              > most literals in a lexer tend to be short words and will be about as
              > efficient as small integer array compares. This also allows all of
              > Antlr's internal state values (EOF, etc.) to be disjoint from all
              > characters (by using negative values)
              >
              > The only major stumbling block to Antlr's use of Unicode internally are
              > its bit sets and the need for compliment. In the generated code, the
              > use of bit sets is very regular, and a slightly more powerful
              > representation could easily support Unicode with complemented sets
              > without them always being O(2^20) bits in size. Antlr's use of bit
              > sets during the analysis and generation, however, might need some more
              > sophisticated bit set class to handle things without simply resorting
              > to huge bit maps. I'd be happy to lend some coding effort to make this
              > work.
              >
              > When Antlr is used to parse binary formats, there is no real harm in
              > the internal Unicode interpretation. The input source would only
              > happen to supply characters less than 256. That set complements would
              > include characters beyond 8 btis wouldn't matter: They'd never be
              > presented by the input souce. The only slight trick would be in proper
              > handling of 0, which isn't a valid Unicode character. But I don't
              > think this would pose much of a problem.
              >
              > - Mark
              >
              >
              > Mark Lentczner
              > markl@...
              > http://www.wheatfarm.org/
              >
              >
              >
              >
              > Yahoo! Groups Links
              >
              >
              >
              >
              >
            • Oliver Zeigermann
              Oooops! Again I was wrong :( Brian, thanks for the enlightening pointers :) Oliver
              Message 6 of 26 , May 3, 2004
                Oooops! Again I was wrong :(

                Brian, thanks for the enlightening pointers :)

                Oliver

                Brian Smith wrote:
                > Oliver Zeigermann wrote:
                >
                >
                >>Mike Lischke wrote:
                >>
                >>
                >>
                >>
                >>>>Now you seem to mix something up. Both UTF-16 and UTF-32 are
                >>>>character encodings as well, just as UTF-8. All of them are
                >>>>converted to characters before parsing.
                >>>>
                >>>>
                >>>
                >>>Sure, but how is the internal representation? Actually, it is UTF-16. So although it is a transformation format it is
                >>>also the actual character representation. Hence UTF-16 (as well as UTF-32) can be processed directly. UTF-8 has to be
                >>>converted first to one of these formats (usually, at least). This is what I meant.
                >>>
                >>>
                >>
                >>What the internal representation is, you simply do not know and there is
                >>also no need to know. Certainly, it is not UTF-16 as it only allows for
                >>64K characters which is far to little.
                >>
                >>
                >>
                >
                > Oliver,
                >
                > In ANTLR for Java, you do know the representation and for some
                > applications is it important. It is a 16-bit integer described by the
                > 'char' type. For JRE 1.2-1.4, 'char' is a 16-bit Unicode code point.
                > (Unicode 1.x - 3.x depending on the JRE version). In JRE 1.5, 'char' is
                > redefined to be a 16-bit Unicode 4.0 code unit, that may represent
                > either a whole character (code point), or a partial character that needs
                > to be combined with an adjacent one according to the UTF-16
                > transformation rules. See http://weblogs.java.net/pub/wlg/1202 and the
                > documents it references.
                >
                > IMO, in order to fully support Unicode 4.0, ANTLR (for Java) would need
                > to replace all usages of 'char' with 'java.lang.String' or 'int.'
                >
                > - Brian
                >
                >
                >
                >
                >
                > Yahoo! Groups Links
                >
                >
                >
                >
                >
                >
              • Brian Smith
                This is what I prefer as well. - Brian
                Message 7 of 26 , May 3, 2004
                  This is what I prefer as well.

                  - Brian

                  >>2) Antlr Features
                  >>-----------------
                  >>The only semantic aspect of Antlr that actually depends on
                  >>charVocabulary is the concept of compliment (element and set). What
                  >>started this thread was Terrance's observation that it is a constant
                  >>source of pitfalls: Currently inversion means "of all the characters
                  >>used in the grammar, not these". Which means that if my grammar only
                  >>mentions 'A'..'Z', and '0'..'9', then "~('0'..'9')" only means
                  >>'A'..'Z'. What most people expect is that "~('0'..'9')" should mean
                  >>ANY character in the input stream except '0'..'9'. Rather than fix
                  >>this by changing the default charVocabulary, a better approach is to
                  >>just to directly change the meaning of compliment to mean what people
                  >>expect it to mean. (See notes below on set inversion).
                  >>
                  >>
                  >>
                • Anthony Youngman
                  I think ISO-8859-1 has been obsoleted, though. About 4 years ago. The new character set includes the Euro symbol and is, iirc, ISO-8859-15. Cheers, Wol ...
                  Message 8 of 26 , May 4, 2004
                    I think ISO-8859-1 has been obsoleted, though. About 4 years ago.

                    The new character set includes the Euro symbol and is, iirc,
                    ISO-8859-15.

                    Cheers,
                    Wol

                    -----Original Message-----
                    From: Terence Parr [mailto:parrt@...]
                    Sent: 02 May 2004 18:07
                    To: antlr-interest@yahoogroups.com
                    Subject: Re: [antlr-interest] proposal for 2.7.4: charVocabulary
                    defaults to ascii 1..127


                    On May 2, 2004, at 8:17 AM, Ric Klaren wrote:

                    > On Sat, May 01, 2004 at 11:42:40AM -0700, Terence Parr wrote:
                    >> Anybody object? I'm seeing this issue come up too many times. So,
                    if
                    >> you don't specify, then charVocabulary is set for you to ascii.
                    >
                    > ASCII or the range 3-254 (extended ascii was it?) I'm not sure how
                    > many of
                    > the reserved values 0-3 are used still in backends and/or the
                    > analyzer....
                    > Making unicode default I dunno.. personally I would not do that. It
                    > increases
                    > the default lexer size (not sure how much it blows up though).

                    Yeah, i'm thinking LATIN (0..254) would be the right approach to start
                    with (start small as they say). I will push out 2.7.4 over the next
                    day or two. Sounds like I need to really think about UNICODE. Can
                    easily be added gradually with some point releases.

                    Ter






                    Yahoo! Groups Links









                    ****************************************************************************

                    This transmission is intended for the named recipient only. It may contain private and confidential information. If this has come to you in error you must not act on anything disclosed in it, nor must you copy it, modify it, disseminate it in any way, or show it to anyone. Please e-mail the sender to inform us of the transmission error or telephone ECA International immediately and delete the e-mail from your information system.

                    Telephone numbers for ECA International offices are: Sydney +61 (0)2 9911 7799, Hong Kong + 852 2121 2388, London +44 (0)20 7351 5000 and New York +1 212 582 2333.

                    ****************************************************************************
                  Your message has been successfully submitted and would be delivered to recipients shortly.