Loading ...
Sorry, an error occurred while loading the content.

Better interfaces to Unicode<==>UTF8 conversion routines in FriBidi?

Expand Messages
  • Omer Zak
    Dov Grobgeld s FriBidi package has two procedures for conversion between UTF8 and 4-byte Unicode forms of character strings. The prototypes of the procedures
    Message 1 of 1 , Oct 1, 2000
    • 0 Attachment
      Dov Grobgeld's FriBidi package has two procedures for conversion between
      UTF8 and 4-byte Unicode forms of character strings.

      The prototypes of the procedures are:

      -=-=-=->
      void fribidi_unicode_to_utf8 (FriBidiChar *us,
      int length,
      /* Output */
      guchar *s);
      /* warning: the length of output string may exceed the length of the input
      */

      int fribidi_utf8_to_unicode (guchar *s,
      /* Output */
      FriBidiChar *us);
      /* the length of the string is returned */
      -=-=-=->

      (declared in fribidi_char_sets.h).

      Those interfaces suffer from few problems:
      1. Unless you allocate lots of memory, the procedures don't protect you
      against accidental overwriting of memory beyond your buffers.
      The worst case of converting from Unicode into UTF8 is that a single
      Unicode character (4 octets long) may be converted into 6-octet long
      UTF character sequence.

      Therefore, for safe operation of fribidi_unicode_to_utf8(), your UTF8
      buffer (guchar *s) must be at least 1.5 times longer than your input
      Unicode buffer (FriBidiChar *us), in octets.

      2. They are not friendly toward applications, which have to convert very
      long strings from one into another format, but do not have too much
      memory to spare - because you have to convert the entire string at
      once - allocate a long buffer, and wait for as long time as necessary
      for the operation to complete.

      Also, the buffer areas have to be contiguous.

      I designed better interfaces, as follows:

      -=-=-=->
      gboolean /* Returns TRUE if the outputs are valid, even if the entire
      ** Unicode string was not converted.
      */
      fribidi_unicode_to_utf8_p(FriBidiChar *in_unicode_str,
      /* Unicode string */
      guint in_unicode_length, /* Unicode string length in
      ** Unicode characters
      */
      guchar *utf8_buffer, /* Buffer for UTF8 translation */
      guint utf8_buffer_length, /* Length of UTF8 buffer */
      /* Outputs */
      guint *out_uni_consumed_length_p,
      /* Actual number of Unicode
      ** characters translated
      */
      guint *out_actual_utf8_buffer_length_p);
      /* Actual number of bytes
      ** used in the UTF8 buffer.
      */

      gboolean /* Returns TRUE if the UTF8 string was converted without
      ** errors, and the outputs are valid - even if the entire
      ** UTF8 string was not converted.
      */
      fribidi_utf8_to_unicode_p(guchar *in_utf8_str, /* UTF8 string */
      guint in_utf8_length, /* Length of UTF8 string in octets
      */
      FriBidiChar *unicode_buffer, /* Buffer for Unicode translation */
      guint unicode_buffer_length, /* Length of Unicode buffer in
      ** Unicode characters
      */
      /* Outputs */
      guint *out_utf8_consumed_length_p, /* Actual number of UTF8
      /* octets translated
      */
      guint *out_actual_unicode_buffer_length_p);
      /* Actual number of Unicode
      ** characters used in the
      ** Unicode buffer.
      */
      -=-=-=->

      With the new interfaces, you can convert the input string in pieces, just
      as long as necessary to fill the buffer allocated for the output string.

      An example of using such procedures is to retrieve a long text string
      from a text widget, convert it in pieces of about 4096 bytes each, and
      write each converted piece into a file (or to a TCP socket).

      While memory is not that big problem in PCs, which run Linux, it is a
      problem in the embedded device, for which I ported FriBidi.

      I would appreciate being informed if I overlooked anything obvious in the
      above design, which may be of a problem to future users of the conversion
      procedures.
      --- Omer
      WARNING TO SPAMMERS: see at http://www.zak.co.il/spamwarning.html
    Your message has been successfully submitted and would be delivered to recipients shortly.