Loading ...
Sorry, an error occurred while loading the content.

Re: Lexing strings (ANTLR bug?)

Expand Messages
  • parrt@xxxxx.xxxx
    ... In your case it is what you want, but in many cases French is not valid in your Chinese textbook. Ter
    Message 1 of 24 , Jan 4, 2000
    • 0 Attachment
      Matthew Ford writes:
      >From: "Matthew Ford" <Matthew.Ford@...>
      >
      >
      >----- Original Message -----
      >From: <parrt@...>
      >To: <antlr-interest@onelist.com>
      >Sent: Wednesday, January 05, 2000 8:15 AM
      >Subject: RE: [antlr-interest] Lexing strings (ANTLR bug?)
      >
      >
      >> From: <parrt@...>
      >>
      >> Luke Blanshard writes:
      >> >From: "Luke Blanshard" <Luke@...>
      >> >
      >> >> What does ~'a' mean? It can only be meaningful if you specify the
      >> >> vocabulary. Is it Korean and English? Kanji? French? There is no
      >> >> way to get around a charVocabulary option I'm afraid.
      >> >
      >> > Not sure I'd agree with that. To me, ~'a' means any Unicode character
      >> >other than lowercase a. That is certainly what I'm looking for when
      >writing
      >> >string-literal lexers, for example (though usually 'a' is not the negated
      >> >character!).
      >>
      >> So...you would allow kanji characters in your French input? ;)
      >
      >Why not? I get mixed Japanese and English email.
      >I am working on a translator for a new database language and in the Japanese
      >version the database could well have French place names in the database, so
      >I need to pass through French characters in the comparision string values
      >eg. field1 = "...."

      In your case it is what you want, but in many cases French is not
      valid in your Chinese textbook.

      Ter
    • Matthew Ford
      ... From: To: Sent: Wednesday, January 05, 2000 9:09 AM Subject: Re: [antlr-interest] Lexing strings (ANTLR
      Message 2 of 24 , Jan 4, 2000
      • 0 Attachment
        ----- Original Message -----
        From: <parrt@...>
        To: <antlr-interest@onelist.com>
        Sent: Wednesday, January 05, 2000 9:09 AM
        Subject: Re: [antlr-interest] Lexing strings (ANTLR bug?)


        > From: <parrt@...>
        >
        > Matthew Ford writes:
        > >From: "Matthew Ford" <Matthew.Ford@...>
        > >
        > >
        > >----- Original Message -----
        > >From: <parrt@...>
        > >To: <antlr-interest@onelist.com>
        > >Sent: Wednesday, January 05, 2000 8:15 AM
        > >Subject: RE: [antlr-interest] Lexing strings (ANTLR bug?)
        > >
        > >
        > >> From: <parrt@...>
        > >>
        > >> Luke Blanshard writes:
        > >> >From: "Luke Blanshard" <Luke@...>
        > >> >
        > >> >> What does ~'a' mean? It can only be meaningful if you specify the
        > >> >> vocabulary. Is it Korean and English? Kanji? French? There is no
        > >> >> way to get around a charVocabulary option I'm afraid.
        > >> >
        > >> > Not sure I'd agree with that. To me, ~'a' means any Unicode
        character
        > >> >other than lowercase a. That is certainly what I'm looking for when
        > >writing
        > >> >string-literal lexers, for example (though usually 'a' is not the
        negated
        > >> >character!).
        > >>
        > >> So...you would allow kanji characters in your French input? ;)
        > >
        > >Why not? I get mixed Japanese and English email.
        > >I am working on a translator for a new database language and in the
        Japanese
        > >version the database could well have French place names in the database,
        so
        > >I need to pass through French characters in the comparision string values
        > >eg. field1 = "...."
        >
        > In your case it is what you want, but in many cases French is not
        > valid in your Chinese textbook.

        Well OK French is pushing it, but the Japanese are starting to freely mix
        English and Japanese. For example when coding they use English for the
        program and Japanese for the comments.

        I my case not only the database contents but the field names may be English
        or Japanese also.

        I really like the way I can knock up a quick translator in Antlr and I am
        writing one to handle the differences between the user interface and what
        the database engine can actually do. (Translate a nice langauge to a dirty
        one).
        Then as the back end database improves I can just upgrade the translator and
        leave the user interface and the rest of the code the same.

        However, the back end database is developed in Japan and they expect to have
        access to our new user interface as well (well they are paying for it).
        This is one of the reasons I chose the develop the user interface in Java
        because of the international support (and I like the language as well :-) )

        So how do I handle Japanese input from the user interface via the Antlr
        translator?

        Matthew Ford
        >
        > Ter
        >
        > >
      • Braden N. McDaniel
        ... Aha... charVocabulary was the missing element in the equation for me. Thanks. However, I think I ve decided that ANTLR is not the right tool for the job
        Message 3 of 24 , Jan 5, 2000
        • 0 Attachment
          On Tue, 4 Jan 2000, Luke Blanshard wrote:

          > From: "Luke Blanshard" <Luke@...>
          >
          > > From: "Braden N. McDaniel" <braden@...>
          > > ...
          > > > > The point is moot, really, as this still does not solve the problem of
          > > > > ANTLR assuming that a newline should terminate a string. Is this a
          > > > > bug/deficiency in ANTLR?
          > > >
          > > > Since you haven't tried it, you can't possibly make this comment.
          > >
          > > Why did you make this assumption? Of course I tried it. As far as I
          > > can tell, my comment is accurate, and my question stands.
          >
          > Sorry, I guess I was reading your comments too literally. To my way of
          > thinking, the sentence "that doesn't strike me as something that would work"
          > is a hypothetical, not an experience report.
          >
          > It's possible you're being tripped up by one of Antlr's (IMHO) actual
          > deficiencies, namely its handling of characters. Unless you mention a
          > character explicitly, Antlr treats it as not being present in the set of
          > characters being lexed. This is analogous to its treatment of tokens, but
          > is frankly misleading. To get around this, you need the "charVocabulary"
          > option in your lexer, specifying the range of characters possible on input.

          Aha... charVocabulary was the missing element in the equation for me.
          Thanks.

          However, I think I've decided that ANTLR is not the right tool for the job
          for my lexer. I've been able to write a lexer by hand that compiles to
          around 22k. The ANTLR-generated lexer is about 100k bigger than that.

          I'm still using ANTLR for my parser... at least until I decide I need to
          add Unicode support. The language I'm parsing (VRML97) only allows
          ISO-10646 characters in identifiers, but string literals can include any
          character expressable in UTF8. Presently, it is *extremely* uncommon to
          encounter multibyte characters in a string literal. (I don't know of any
          VRML parser that will accommodate this.) But it *is* possible, and at some
          point I'd like to support it.

          --
          Braden N. McDaniel
          braden@...
          <URL:http://www.endoframe.com>
        • Luke Blanshard
          ... Just to clarify things here, there is no relation between Unicode support and parsing. The Unicode issue only pertains to lexers, not parsers. If your
          Message 4 of 24 , Jan 5, 2000
          • 0 Attachment
            > From: Braden N. McDaniel [mailto:braden@...]
            ...
            > I'm still using ANTLR for my parser... at least until I decide I need to
            > add Unicode support...

            Just to clarify things here, there is no relation between Unicode support
            and parsing. The Unicode issue only pertains to lexers, not parsers. If
            your hand-coded lexer can handle Unicode, you will be fine.

            Luke
          • Braden N. McDaniel
            ... Well, theoretically. In practice, on the C++ side of ANTLR, a Token s text is stored as a std::string, which is a char string. chars are 1 (byte). I
            Message 5 of 24 , Jan 6, 2000
            • 0 Attachment
              On Wed, 5 Jan 2000, Luke Blanshard wrote:

              > > From: Braden N. McDaniel [mailto:braden@...]
              > ...
              > > I'm still using ANTLR for my parser... at least until I decide I need to
              > > add Unicode support...
              >
              > Just to clarify things here, there is no relation between Unicode support
              > and parsing. The Unicode issue only pertains to lexers, not parsers. If
              > your hand-coded lexer can handle Unicode, you will be fine.

              Well, theoretically.

              In practice, on the C++ side of ANTLR, a Token's text is stored as a
              std::string, which is a char string. chars are 1 (byte).

              I suppose in my case I could wait until after parsing to convert the
              variable-byte character strings to wide character strings, though it
              strikes me as tidier to do this when scanning.

              --
              Braden N. McDaniel
              braden@...
              <URL:http://www.endoframe.com>
            • Luke Blanshard
              ... Whoops, I assumed you were working in Java. I take it all back!
              Message 6 of 24 , Jan 6, 2000
              • 0 Attachment
                > From: Braden N. McDaniel [mailto:braden@...]
                ...
                > > The Unicode issue only pertains to lexers, not
                > > parsers. If
                > > your hand-coded lexer can handle Unicode, you will be fine.
                >
                > Well, theoretically.
                >
                > In practice, on the C++ side of ANTLR, a Token's text is stored as a
                > std::string, which is a char string. chars are 1 (byte).

                Whoops, I assumed you were working in Java. I take it all back!
              Your message has been successfully submitted and would be delivered to recipients shortly.