Loading ...
Sorry, an error occurred while loading the content.

Re: [xenu-usergroup] non-ASCII and non-Latin characters in URLs

Expand Messages
  • Tilman Hausherr
    The page itself is in unicode or MBCS or whatever. Xenu itself doesn t support this. I m clueless about it, because I never needed it at work ... If Firefox
    Message 1 of 3 , Apr 19, 2007
    • 0 Attachment
      The page itself is in unicode or MBCS or whatever. Xenu itself doesn't
      support this. I'm clueless about it, because I never needed it at work
      :(

      If Firefox supports such weird URLs, then its probably not illegal :-(

      Tilman

      On Thu, 19 Apr 2007 09:11:39 -0000, frank visser wrote:

      >
      >I have a question about the use of non-ASCII or non-Latin characters
      >in URLs and how Xenu handles this.
      >
      >For example, on http://www.pickwicktea.com/RU/Home.htm there's a top
      >header link (the third from the left) that goes to this page:
      >http://www.pickwicktea.com/RU/Ограничение+ответственности.htm
      >
      >which, as you can see (hopefully this gets across when I post this in
      >the Xenu Yahoo grup, which I doubt, because in the Preview of the
      >message they got converted!) uses non-Latin (Russian) characters in
      >the URL.
      >
      >IE displays this correctly; and the URL appears in the browser
      >address field as:
      >http://www.pickwicktea.com/RU/Ограничение+ответственности.htm
      >
      >In Firefox, this URL gets displayed in the browser address field as:
      >http://www.pickwicktea.com/RU/%D0%9E%D0%B3%D1%80%D0%B0%D0%BD%D0%B8%D1%
      >87%D0%B5%D0%BD%D0%B8%D0%B5+%D0%BE%D1%82%D0%B2%D0%B5%D1%82%D1%81%D1%82%
      >D0%B2%D0%B5%D0%BD%D0%BD%D0%BE%D1%81%D1%82%D0%B8.htm
      >
      >but the page is still displayed correctly.
      >
      >
      >However Xenu converts it to http://www.pickwicktea.com/RU/???????????
      >+???????????????.htm and reports a 404.
      >
      >When I scan this URL directly with Xenu however, I see initially that
      >the link gets a OK, but then, at level 1, the URL appears as:
      >http://www.pickwicktea.com/RU/%D0ž%D0%B3%D1€%D0%B0%D0%BD%D0%B8%D1‡%D0%
      >B5%D0%BD%D0%B8%D0%B5+%D0%BE%D1‚%D0%B2%D0%B5%D1‚%D1%81%D1‚%D0%B2%D0%B5%
      >D0%BD%D0%BD%D0%BE%D1%81%D1‚%D0%B8.htm
      >error code: 400 (no object data),
      >
      >
      >so this leads to the following questions:
      >
      >1. is it illegal to use non-ascii characters in URLs?
      >2. is it illegal to use non-Latin characters in URLs?
      >3. why can IE parse this URL, but Firefox or Xenu can't?
      >4. why does Xenu report an OK first, but a 400 next?
      >
      >This Russian site is built with MCMS, a Microsoft content management
      >system.
      >
      >Users can paste the names of channels (subfolders) directly into the
      >system, and they often use non-ASCII characters. I want to understand
      >if this is illegal or not.
      >
      >
      >
      >
      >
      >Yahoo! Groups Links
      >
      >
      >
    • Stephen Gazard
      ... The page is in Unicode (at least is uses the character set UTF-8). The proper method to deal with such characters (as far as I was aware) is to encode the
      Message 2 of 3 , Apr 20, 2007
      • 0 Attachment
        On Thu, 19 Apr 2007 09:11:39 -0000, frank visser wrote:

        >
        >I have a question about the use of non-ASCII or non-Latin characters
        >in URLs and how Xenu handles this.
        >
        >For example, on http://www.pickwicktea.com/RU/Home.htm there's a top
        >header link (the third from the left) that goes to this page:
        >http://www.pickwicktea.com/RU/Ограничение+ответственности.htm
        >
        >which, uses non-Latin (Russian) characters in
        >the URL.

        The page is in Unicode (at least is uses the character set UTF-8). The
        proper method to deal with such characters (as far as I was aware) is to
        encode the items in hexadecimal since it then contains none of =?&, all
        of which are part of urls that are passing arguments to scripts on the
        server.

        >
        >IE displays this correctly;

        Firefox is doing what I said above. Internet Explorer is NOT being
        correct. It is simply showing the HTML entities for those characters in
        the correct character set in the url bar (wrong). It will almost
        certainly be passing on the correct translation to the server.

        If we take a simple English example of (excluding double quotes): "This
        is the way I walk my dog!?"
        The proper way to encode that as an URI is:
        "This+is+the+way+I+walk+my+dog%21%3F"

        If you really wanted to, the spaces can be converted to %20, giving
        (again excluding double quotes)
        "This%20is%20the%20way%20I%20walk%20my%20dog%21%3F"

        >
        >In Firefox, this URL gets displayed in the browser address field as:
        >http://www.pickwicktea.com/RU/%D0%9E%D0%B3%D1%80%D0%B0%D0%BD%D0%B8%D1%
        >87%D0%B5%D0%BD%D0%B8%D0%B5+%D0%BE%D1%82%D0%B2%D0%B5%D1%82%D1%81%D1%82%
        >D0%B2%D0%B5%D0%BD%D0%BD%D0%BE%D1%81%D1%82%D0%B8.htm
        >
        >but the page is still displayed correctly.

        As I said above. this is correct and good behaviour

        >so this leads to the following questions:
        >
        >1. is it illegal to use non-ASCII characters in URLs?

        That's a matter of debate with international domain names and all
        because it's hard to implement properly and can lead to flaws. See
        http://www.theregister.co.uk/2005/02/07/browsers_idn_spoofing/ Ideally
        you would encode all urls to be hexadecimal.

        >2. is it illegal to use non-Latin characters in URLs?

        Not sure on that one

        >3. why can IE parse this URL, but Firefox or Xenu can't?

        IE and Firefox parse the URL; Firefox displays the correct URL in the
        address bar, while IE does not. Since Xenu runs off the Trident
        rendering engine built into Internet Explorer, it uses what IE gives it,
        which may be wrong. Please note I have no experience embedding the
        trident engine at all.

        >4. why does Xenu report an OK first, but a 400 next?

        That's for Tilman to answer. should be related to what he gets back
        from the trident engine
      Your message has been successfully submitted and would be delivered to recipients shortly.