Loading ...
Sorry, an error occurred while loading the content.

Re: [xml-doc] Can't Get UTF Characters to Work

Expand Messages
  • Michael(tm) Smith
    ... [...] ... I know you already got an answer to your question, but note that rather than trying to figure out what the character is by checking how the file
    Message 1 of 9 , Mar 27, 2006
    • 0 Attachment
      Adam Ophir Shapira <red_angel@...> writes:

      > David Sewell wrote:
      > >
      > > (2) what editor or editing software are you using to create the file?
      > I use both -vi- and -BBEdit-.
      > > (3) what are you using to display output?
      > After exporting from XML to HTML, I looked at it both with
      > Mozilla and with Safari.
      > >
      > Instead of getting "[eacute] I got "[Atilde+copy]".
      > It appeared that way both on Firefox and on Safari.

      I know you already got an answer to your question, but note that
      rather than trying to figure out what the character is by checking
      how the file contents are displayed in a browser or whatever, you
      can use a hex-dump utility or hex editor to determine exactly what
      the character is -- the hexedit or xxd commands if you're working
      in a command-line environment, or whatever equivalent is built
      into your editing app.

      I would guess BBEdit has some kind of hex mode. In Emacs, you can
      do "M-x hexl-mode". In Vim, you can do ":%! xxd".

      Regardless of what you use, what you'll see is something like this:

      00002a0: 6164 3f0a 0a49 6e73 7465 6164 206f 6620 ad?..Instead of
      00002b0: 6765 7474 696e 6720 22e9 2220 4920 676f getting "." I go
      00002c0: 7420 22c3 a922 2e0a 0a49 7420 6170 7065 t ".."...It appe

      That's a fragment of your file as seen by xxd. It shows the file
      using one line for every sixteen bytes. It shows the hexadecimal
      value for every byte in the file, along with an ASCII
      representation of the contents (at the far right). Bytes that
      can't be displayed with an ASCII character are shown with a dot.

      To figure out what a particular dot corresponds to, you count
      over. So for the dot in the "getting" line -- which is where the
      acute e character shows up in your original message -- you can see
      that corresponds to the single hex value "e9". And in the next
      line down, you'll see that the borked stuff showing up when you
      display it in a browser is two bytes, "c3a9".

      Selection of the glyphs that are used to display those bytes when
      you view them in some app depends on what encoding the application
      thinks your file is in. In the case of your mail message, your mail
      client sent it with the following header:

      Content-Type: text/plain; charset=ISO-8859-1

      So when I view it in my mail client, that e9 is displayed with an
      "e with acute accent" glyph -- as expected, because in ISO-8859-1
      encoding, a single e9 = eacute -- and the c3a9 pair shows up
      borked. Because in ISO-8859-1, c3+a9 = Atilde+copy (capital A with
      a tilde, followed by the copyright symbol).

      But if the charset part of your message's Content-Type header had
      "charset=UTF-8" instead, the c3a9 would actually be displayed with
      an "e with acute accent" glyph, and the e9 would show up with some
      (undefined) strange character -- a black or white box, or a
      black diamond with a question mark, or maybe even just a question
      mark. The reason being that in UTF-8, a single hex e9 does not
      correspond to any displayable character.

      If you look up the character "e9" in a Unicode character database
      of some kind, like the one at the Zvon site, you might be led to
      conclude that e9 in "Unicode" should be displayed as an eacute,
      just as it is in IS0-8859-1.


      If you look at that page, it'll tell you that the e9 corresponds
      to the Unicode character "LATIN SMALL LETTER E WITH ACUTE". But
      the problem is that what doesn't tell you anything at all about is
      what it corresponds to in a particular Unicode encoding. Most of
      the time, what you'd probably want to know is what it is in UTF-8,
      which isn't the same as its actual Unicode value. The reason is
      that in UTF-8, unlike ISO-8859-1, most special characters are
      represented by two bytes.

      There's a very good online reference that will tell you what the
      hex values are for UTF-8-encoded versions of Unicode code points --
      the "letter database" at the Institute of the Estonian Language:


      If you look at the bottom of the left-hand column, you can see
      that it says Unicode 00e9 corresponds to c3a9 in UTF-8. Another
      part of the page tells you what it corresponds to in other
      charsets (for example, e9 in ISO-8859-1).

      So in the case of your content, as others on the list have pointed
      out, you just need to tell your application that the contents are
      UTF-8 encoded instead of IS0-8859-1 encoded. If that application
      happens to be a web browser and your content is being served up to
      the browser from a Web server, one common problem is that many
      Apache web servers are configured to serve up pages with a
      particular charset setting in the HTTP headers. And the default
      value for that setting is "ISO-8859-1".

    Your message has been successfully submitted and would be delivered to recipients shortly.