Loading ...
Sorry, an error occurred while loading the content.
 

Time for another question about Unicode support

Expand Messages
  • David Ewing
    I ve been using ANTLR for a while now, and I need to get it to handle Unicode input. We use ANTLR to parse Java source code for indexing information in Project
    Message 1 of 3 , Oct 23 10:29 PM
      I've been using ANTLR for a while now, and I need to get it to handle
      Unicode input. We use ANTLR to parse Java source code for indexing
      information in Project Builder, Apple's IDE for Mac OS X.

      It's not obvious to me how much work has gone on in this area for 2.7.2.
      Scanning the list archives, it looks like some work has been done to support
      it in Java parsers, but not for C++ parsers. Of course, we're generating a
      C++ parser. (Yes, we use a C++ parser, called from Objective C, to parse
      Java code!)

      So, in my search for what to do along these lines, I ran into ICU
      (International Components for Unicode), an open source library from IBM
      <http://oss.software.ibm.com/icu>. Older versions of it are the basis of the
      i18n classes in the JDK. There are both Java and C++ versions. It seems to
      contain appropriate character set classes, which might solve that issue on
      the C++ side. So, has using ICU been considered for ANTLR?

      I may be able to help out in this effort, though for me that would mean
      starting work on it soon. My guess is that my time pressures will mean
      writing a custom lexer to deal with Unicode. Something that would return IDs
      with UTF-8 strings. But I'd rather not do it that way. I'd rather help out
      adding the support "the right way".

      Anyhow, any info or recommendations would be greatly appreciated.

      Dave
      --
      David Ewing, Mac OS X Development Apps, Apple Computer
      --
    • Ric Klaren
      Hi, ... I ve looked at it and at a few others. (but ICU looked quite nice maybe nicest), then again I don t know if I want a dependency on some external
      Message 2 of 3 , Oct 24 2:24 AM
        Hi,

        On Tue, Oct 23, 2001 at 11:29:17PM -0600, David Ewing wrote:
        > the C++ side. So, has using ICU been considered for ANTLR?

        I've looked at it and at a few others. (but ICU looked quite nice maybe
        nicest), then again I don't know if I want a dependency on some external
        library... (without support in antlr's (c++) codegen to switch between
        support libraries etc.)

        So far I've ditched any attempts at unicode for C++. (I have no personal
        interest in it, no interest from the project I'm working on (so my boss
        won't sponsor it) and the subject is way to hairy (and uninteresting) to
        spent my free time on)

        > I may be able to help out in this effort, though for me that would mean
        > starting work on it soon. My guess is that my time pressures will mean
        > writing a custom lexer to deal with Unicode. Something that would return IDs
        > with UTF-8 strings. But I'd rather not do it that way. I'd rather help out
        > adding the support "the right way".

        If you are willing to really look into this than I can only cheer you on =)
        and help

        > Anyhow, any info or recommendations would be greatly appreciated.

        See this post/thread for some thoughts I spewed out on this in the past:

        http://groups.yahoo.com/group/antlr-interest/message/3973

        Ric
        --
        -----+++++*****************************************************+++++++++-------
        ---- Ric Klaren ----- klaren@... ----- +31 53 4893722 ----
        -----+++++*****************************************************+++++++++-------
        Wit is cultured insolence. - Aristotle
      • David Ewing
        Ric, That s pretty much where I thought things were. I had read your other message already, but had overlooked your reference to ICU. I actually found out
        Message 3 of 3 , Oct 24 7:53 AM
          Ric,

          That's pretty much where I thought things were. I had read your other
          message already, but had overlooked your reference to ICU. I actually found
          out about ICU by looking at the jikes sources - I also need to get it to
          support different encodings on Mac OS X.

          Personally, I'd say if you're going to depend on an external library for
          Unicode, ICU is the way to go. Unfortunately, I don't know the internals of
          the antlr library well enough to take this on alone. At least not
          considering my time constraints. Writing a lexer that handles Unicode is
          probably less than a week's worth of work, since my task of parsing Java is
          so narrow in scope (compared to generic Unicode support in antlr). Adding
          generic support is probably an order of magnitude more work. If there had
          been enough work done to give me a head start, I might have been able to
          take it on. But that isn't the case. Oh well.

          Thanks,
          Dave

          on 10/24/01 3:24 AM, Ric Klaren at klaren@... wrote:
          > On Tue, Oct 23, 2001 at 11:29:17PM -0600, David Ewing wrote:
          >> the C++ side. So, has using ICU been considered for ANTLR?
          >
          > I've looked at it and at a few others. (but ICU looked quite nice maybe
          > nicest), then again I don't know if I want a dependency on some external
          > library... (without support in antlr's (c++) codegen to switch between
          > support libraries etc.)
          >
          > So far I've ditched any attempts at unicode for C++. (I have no personal
          > interest in it, no interest from the project I'm working on (so my boss
          > won't sponsor it) and the subject is way to hairy (and uninteresting) to
          > spent my free time on)
          >
          >> I may be able to help out in this effort, though for me that would mean
          >> starting work on it soon. My guess is that my time pressures will mean
          >> writing a custom lexer to deal with Unicode. Something that would return IDs
          >> with UTF-8 strings. But I'd rather not do it that way. I'd rather help out
          >> adding the support "the right way".
          >
          > If you are willing to really look into this than I can only cheer you on =)
          > and help
          >
          >> Anyhow, any info or recommendations would be greatly appreciated.
          >
          > See this post/thread for some thoughts I spewed out on this in the past:
          >
          > http://groups.yahoo.com/group/antlr-interest/message/3973
        Your message has been successfully submitted and would be delivered to recipients shortly.