Re: [xenu-usergroup] non-ASCII and non-Latin characters in URLs
- On Thu, 19 Apr 2007 09:11:39 -0000, frank visser wrote:
>The page is in Unicode (at least is uses the character set UTF-8). The
>I have a question about the use of non-ASCII or non-Latin characters
>in URLs and how Xenu handles this.
>For example, on http://www.pickwicktea.com/RU/Home.htm there's a top
>header link (the third from the left) that goes to this page:
>which, uses non-Latin (Russian) characters in
proper method to deal with such characters (as far as I was aware) is to
encode the items in hexadecimal since it then contains none of =?&, all
of which are part of urls that are passing arguments to scripts on the
>Firefox is doing what I said above. Internet Explorer is NOT being
>IE displays this correctly;
correct. It is simply showing the HTML entities for those characters in
the correct character set in the url bar (wrong). It will almost
certainly be passing on the correct translation to the server.
If we take a simple English example of (excluding double quotes): "This
is the way I walk my dog!?"
The proper way to encode that as an URI is:
If you really wanted to, the spaces can be converted to %20, giving
(again excluding double quotes)
>As I said above. this is correct and good behaviour
>In Firefox, this URL gets displayed in the browser address field as:
>but the page is still displayed correctly.
>so this leads to the following questions:That's a matter of debate with international domain names and all
>1. is it illegal to use non-ASCII characters in URLs?
because it's hard to implement properly and can lead to flaws. See
you would encode all urls to be hexadecimal.
>2. is it illegal to use non-Latin characters in URLs?Not sure on that one
>3. why can IE parse this URL, but Firefox or Xenu can't?IE and Firefox parse the URL; Firefox displays the correct URL in the
address bar, while IE does not. Since Xenu runs off the Trident
rendering engine built into Internet Explorer, it uses what IE gives it,
which may be wrong. Please note I have no experience embedding the
trident engine at all.
>4. why does Xenu report an OK first, but a 400 next?That's for Tilman to answer. should be related to what he gets back
from the trident engine