Re: [xml-doc] Can't Get UTF Characters to Work
- Adam Ophir Shapira <red_angel@...> writes:
> David Sewell wrote:[...]
> > (2) what editor or editing software are you using to create the file?
> I use both -vi- and -BBEdit-.
> > (3) what are you using to display output?
> After exporting from XML to HTML, I looked at it both with
> Mozilla and with Safari.
> Instead of getting "[eacute] I got "[Atilde+copy]".I know you already got an answer to your question, but note that
> It appeared that way both on Firefox and on Safari.
rather than trying to figure out what the character is by checking
how the file contents are displayed in a browser or whatever, you
can use a hex-dump utility or hex editor to determine exactly what
the character is -- the hexedit or xxd commands if you're working
in a command-line environment, or whatever equivalent is built
into your editing app.
I would guess BBEdit has some kind of hex mode. In Emacs, you can
do "M-x hexl-mode". In Vim, you can do ":%! xxd".
Regardless of what you use, what you'll see is something like this:
00002a0: 6164 3f0a 0a49 6e73 7465 6164 206f 6620 ad?..Instead of
00002b0: 6765 7474 696e 6720 22e9 2220 4920 676f getting "." I go
00002c0: 7420 22c3 a922 2e0a 0a49 7420 6170 7065 t ".."...It appe
That's a fragment of your file as seen by xxd. It shows the file
using one line for every sixteen bytes. It shows the hexadecimal
value for every byte in the file, along with an ASCII
representation of the contents (at the far right). Bytes that
can't be displayed with an ASCII character are shown with a dot.
To figure out what a particular dot corresponds to, you count
over. So for the dot in the "getting" line -- which is where the
acute e character shows up in your original message -- you can see
that corresponds to the single hex value "e9". And in the next
line down, you'll see that the borked stuff showing up when you
display it in a browser is two bytes, "c3a9".
Selection of the glyphs that are used to display those bytes when
you view them in some app depends on what encoding the application
thinks your file is in. In the case of your mail message, your mail
client sent it with the following header:
Content-Type: text/plain; charset=ISO-8859-1
So when I view it in my mail client, that e9 is displayed with an
"e with acute accent" glyph -- as expected, because in ISO-8859-1
encoding, a single e9 = eacute -- and the c3a9 pair shows up
borked. Because in ISO-8859-1, c3+a9 = Atilde+copy (capital A with
a tilde, followed by the copyright symbol).
But if the charset part of your message's Content-Type header had
"charset=UTF-8" instead, the c3a9 would actually be displayed with
an "e with acute accent" glyph, and the e9 would show up with some
(undefined) strange character -- a black or white box, or a
black diamond with a question mark, or maybe even just a question
mark. The reason being that in UTF-8, a single hex e9 does not
correspond to any displayable character.
If you look up the character "e9" in a Unicode character database
of some kind, like the one at the Zvon site, you might be led to
conclude that e9 in "Unicode" should be displayed as an eacute,
just as it is in IS0-8859-1.
If you look at that page, it'll tell you that the e9 corresponds
to the Unicode character "LATIN SMALL LETTER E WITH ACUTE". But
the problem is that what doesn't tell you anything at all about is
what it corresponds to in a particular Unicode encoding. Most of
the time, what you'd probably want to know is what it is in UTF-8,
which isn't the same as its actual Unicode value. The reason is
that in UTF-8, unlike ISO-8859-1, most special characters are
represented by two bytes.
There's a very good online reference that will tell you what the
hex values are for UTF-8-encoded versions of Unicode code points --
the "letter database" at the Institute of the Estonian Language:
If you look at the bottom of the left-hand column, you can see
that it says Unicode 00e9 corresponds to c3a9 in UTF-8. Another
part of the page tells you what it corresponds to in other
charsets (for example, e9 in ISO-8859-1).
So in the case of your content, as others on the list have pointed
out, you just need to tell your application that the contents are
UTF-8 encoded instead of IS0-8859-1 encoded. If that application
happens to be a web browser and your content is being served up to
the browser from a Web server, one common problem is that many
Apache web servers are configured to serve up pages with a
particular charset setting in the HTTP headers. And the default
value for that setting is "ISO-8859-1".