Loading ...
Sorry, an error occurred while loading the content.

Re: [xenu-usergroup] non-ASCII and non-Latin characters in URLs

Expand Messages
  • Stephen Gazard
    ... The page is in Unicode (at least is uses the character set UTF-8). The proper method to deal with such characters (as far as I was aware) is to encode the
    Message 1 of 3 , Apr 20, 2007
    • 0 Attachment
      On Thu, 19 Apr 2007 09:11:39 -0000, frank visser wrote:

      >
      >I have a question about the use of non-ASCII or non-Latin characters
      >in URLs and how Xenu handles this.
      >
      >For example, on http://www.pickwicktea.com/RU/Home.htm there's a top
      >header link (the third from the left) that goes to this page:
      >http://www.pickwicktea.com/RU/Ограничение+ответственности.htm
      >
      >which, uses non-Latin (Russian) characters in
      >the URL.

      The page is in Unicode (at least is uses the character set UTF-8). The
      proper method to deal with such characters (as far as I was aware) is to
      encode the items in hexadecimal since it then contains none of =?&, all
      of which are part of urls that are passing arguments to scripts on the
      server.

      >
      >IE displays this correctly;

      Firefox is doing what I said above. Internet Explorer is NOT being
      correct. It is simply showing the HTML entities for those characters in
      the correct character set in the url bar (wrong). It will almost
      certainly be passing on the correct translation to the server.

      If we take a simple English example of (excluding double quotes): "This
      is the way I walk my dog!?"
      The proper way to encode that as an URI is:
      "This+is+the+way+I+walk+my+dog%21%3F"

      If you really wanted to, the spaces can be converted to %20, giving
      (again excluding double quotes)
      "This%20is%20the%20way%20I%20walk%20my%20dog%21%3F"

      >
      >In Firefox, this URL gets displayed in the browser address field as:
      >http://www.pickwicktea.com/RU/%D0%9E%D0%B3%D1%80%D0%B0%D0%BD%D0%B8%D1%
      >87%D0%B5%D0%BD%D0%B8%D0%B5+%D0%BE%D1%82%D0%B2%D0%B5%D1%82%D1%81%D1%82%
      >D0%B2%D0%B5%D0%BD%D0%BD%D0%BE%D1%81%D1%82%D0%B8.htm
      >
      >but the page is still displayed correctly.

      As I said above. this is correct and good behaviour

      >so this leads to the following questions:
      >
      >1. is it illegal to use non-ASCII characters in URLs?

      That's a matter of debate with international domain names and all
      because it's hard to implement properly and can lead to flaws. See
      http://www.theregister.co.uk/2005/02/07/browsers_idn_spoofing/ Ideally
      you would encode all urls to be hexadecimal.

      >2. is it illegal to use non-Latin characters in URLs?

      Not sure on that one

      >3. why can IE parse this URL, but Firefox or Xenu can't?

      IE and Firefox parse the URL; Firefox displays the correct URL in the
      address bar, while IE does not. Since Xenu runs off the Trident
      rendering engine built into Internet Explorer, it uses what IE gives it,
      which may be wrong. Please note I have no experience embedding the
      trident engine at all.

      >4. why does Xenu report an OK first, but a 400 next?

      That's for Tilman to answer. should be related to what he gets back
      from the trident engine
    Your message has been successfully submitted and would be delivered to recipients shortly.