Loading ...
Sorry, an error occurred while loading the content.
 

[eiffel-nice-library] Re: A few ARRAY/STRING issues

Expand Messages
  • Durchholz, Joachim
    ... It s not explicit. It s basically in to_c (well, not strictly - it would just become horribly inefficient to implement to_c without it). As you can see, I
    Message 1 of 8 , Oct 8, 1999
      > -----Original Message-----
      > From: franck.arnaud@... [mailto:franck.arnaud@...]
      >
      > > (Windows NT being the important exception. I don't know how
      > > Eiffel is supposed to open files with Chinese names while
      > sticking to the
      > > current ELKS vintage.)
      >
      > You just need a compiler that implements current ELKS with wide
      > characters. It is a limitation of the compilers, not the
      > standard. Could you
      > tell me where in ELKS'95 it is written that a CHARACTER cannot have
      > say 300 bits?

      It's not explicit. It's basically in to_c (well, not strictly - it would
      just become horribly inefficient to implement to_c without it).
      As you can see, I still see to_c as an important part of STRING. Dropping
      that, I agree you can use any encoding... but I see no point in it. STRING
      is for character I/O and for file names, which pretty much defines what a
      STRING can be.

      > > BTW the ELKS 95 idea of a CHARACTER is highly dependent on
      > 7-bit ASCII,
      >
      > Where?

      OK <blush> - you're right, I should have checked the original before saying
      this: The ELKS 95 definition of CHARACTER has just a single reference to
      ASCII - in its header comment.
      What I meant were features like "min_lowercase_letter" and
      "max_lowercase_letter" that assumed that every character between these must
      be a letter (valid only for 7-bit ASCII, not valid for any flavor of 8-bit
      ASCII, EBCDIC, or Unicode). I'm not sure where I have seen these, maybe in
      OOSC?

      > > Worse, STRING assumes that a character has constant size,
      > > which doesn't hold
      > > for UTF8-encoded strings
      >
      > No, it does not.

      You're right again.
      It would even work if STRING were an ARRAY [CHARACTER].

      > > I agree there's an issue here. I don't agree that we'll
      > > even get closer to a solution by keeping the door open for
      > > multibyte CHARACTER types: that would
      > > just break existing code without giving any real advantage.
      >
      > How would that break existing code? First, byte=character
      > implementations would still be valid,

      Of course. The problem is in the code that uses STRING, not in the code that
      defines STRING.

      > and besides I think it breaks very few things after you've
      > checked the external interfaces.

      That's exactly the point. External interfaces tend to be large. I wouldn't
      want to check the wrappers for the MS Windows API, even if it were just a
      change from STRING to MEMORY_BLOCK. I suppose it isn't different for XLib,
      GTK, or other wrappers.
      Of course, we could say that these issues go away with SWIG or another
      automated tool for wrapper generation.


      Well, you have convinced me that character set issues are not a real
      problem. I still have two reservations:
      1) There must be a better design available before marking to_c and
      from_c/from_memory as obsolete.
      2) Switching from STRING representation to a new low-level representation
      should be reasonably easy. You won't get many adherents to a new standard if
      there's serious work involved on the application programmer's side.

      Regards,
      Joachim
      --
      This is a private communication, not a statement from my employer.
    • franck.arnaud@omgroup.com
      ... Agreed. That s why I think to_c should be moved to a mixin class (among other reasons). Still, even this to_c is not incompatible with wide characters: you
      Message 2 of 8 , Oct 8, 1999
        > It's not explicit. It's basically in to_c (well, not strictly - it would

        Agreed. That's why I think to_c should be moved to a mixin
        class (among other reasons). Still, even this to_c is not incompatible
        with wide characters: you can generate a new copy of a (char*)
        string and simply truncate what is >255. to_c is not necessarily
        a pointer to the STRING storage, it can be a function that creates
        a new block on demand.

        > STRING is for character I/O and for file names, which pretty
        > much defines what a STRING can be.

        Is it? And quite a few OS have Unicode filenames (Windows,
        even Linux to some extent).

        > The ELKS 95 definition of CHARACTER has just a single reference to
        > ASCII - in its header comment.

        Yes, and it's wrong as it tends to indicate that characters >127 are not
        CHARACTERs which is not what was intended (and implemented).

        >> and besides I think it breaks very few things after you've
        >> checked the external interfaces.

        > That's exactly the point. External interfaces tend to be large. I wouldn't
        > want to check the wrappers for the MS Windows API,

        It does not seem that difficult to me, and I'd expect compilers
        to support switches (in Ace files or similar) to compile characters
        to one or two bytes for a given system, so you could still use
        the byte=character assumptions when compiling systems that
        use old wrappers.

        > 1) There must be a better design available before marking to_c and
        > from_c/from_memory as obsolete.

        I think the first step is to (a) move them to a mixin class (b) document
        them as being based on byte=character. It involves no semantic
        change, just better documentation, and place them in a mixin class
        where it is easy to add more external features for other formats
        (you don't really want to have 'to_c', 'to_wchar_c', 'to_java',
        'to_utf8_c', 'to_whatever' in STRING in a few years time).
      • Durchholz, Joachim
        ... Usually, systems will migrate slowly from old to new. I.e. you have a mixture of single-byte and double-byte character sets. (If the system is small enough
        Message 3 of 8 , Oct 8, 1999
          > -----Original Message-----
          > From: franck.arnaud@... [mailto:franck.arnaud@...]
          >
          > >> and besides I think it breaks very few things after you've
          > >> checked the external interfaces.
          >
          > > That's exactly the point. External interfaces tend to be
          > > large. I wouldn't want to check the wrappers for the MS Windows
          > > API,
          >
          > It does not seem that difficult to me, and I'd expect compilers
          > to support switches (in Ace files or similar) to compile characters
          > to one or two bytes for a given system, so you could still use
          > the byte=character assumptions when compiling systems that
          > use old wrappers.

          Usually, systems will migrate slowly from old to new. I.e. you have a
          mixture of single-byte and double-byte character sets. (If the system is
          small enough to allow switching everything in a single pass, the system is
          so small that converting is no problem regardless how much work per call is
          involved). A compiler switch will probably not work too well.
          Besides, I don't think compiler implementers will be too grateful if you
          propose this. It's a new feature that can break existing code, which means
          additional effort to get it right and make it work.

          > > 1) There must be a better design available before marking to_c and
          > > from_c/from_memory as obsolete.
          >
          > I think the first step is to (a) move them to a mixin class
          > (b) document
          > them as being based on byte=character. It involves no semantic
          > change, just better documentation, and place them in a mixin class
          > where it is easy to add more external features for other formats

          Step (b) is no problem and should be done.
          I see not much gain in step (a). We're currently working on STRING and
          ARRAY, not on low-level access classes, so it's quite likely that any design
          that we come up with right now will be changed with the next iteration. So
          I'd prefer to leave the thing as it is right now and postpone the issue
          until we have a better idea of How Low-Level Interfacing Should Work.

          > (you don't really want to have 'to_c', 'to_wchar_c', 'to_java',
          > 'to_utf8_c', 'to_whatever' in STRING in a few years time).

          Agreed.

          I think we disagree on the schedule, not in the objectives.

          Regards,
          Joachim
          --
          This is a private communication, not a statement from my employer.
        Your message has been successfully submitted and would be delivered to recipients shortly.