Loading ...
Sorry, an error occurred while loading the content.

RE: [xenu-usergroup] checking links in javascript - different regex results

Expand Messages
  • Frank Visser
    Hi eugeny, Thanks a million, i will try it out today. I want to understand how the regex fits into Xenu code, doesn t Tilmans regex require three parts: 1. the
    Message 1 of 35 , Oct 7, 2004
    • 0 Attachment
      Hi eugeny,



      Thanks a million, i will try it out today.



      I want to understand how the regex fits into Xenu code, doesn't Tilmans
      regex require three parts:



      1. the first part of the URL, minus the extension
      2. the extension
      3. the ? with parameters (optional)



      perhaps his code combines these three parts into the full URL, which is then
      spidered.



      A new regex might require an update of the xenu code, tilman, is that
      correct? (or am I seeing things?)



      frank



      _____

      From: Eugeny.Sattler@... [mailto:Eugeny.Sattler@...]
      Sent: donderdag 7 oktober 2004 12:13
      To: xenu-usergroup@yahoogroups.com
      Subject: [xenu-usergroup] checking links in javascript - different regex
      results



      > using your regex from an earlier mail:

      >
      javascript:\w+\s*\(\s*['"]((?:ftp|https?)://[^'"]+?)['"](?:\s*,[^,]+?\s*)*\s
      *\);
      > xenu crashes after a while.
      There are different regex libraries , each with its own pecularities in
      regex syntax.
      PowerGREP author claims that his regex flavour is fully perl compatible.
      And Perl is considered to be a kind of industry standard in regex syntax.
      I tested my regexp in PowerGREP and it catches all variants in due way, URLs
      with
      or without file extensions, URLs ending with top domain or ending with GET
      parameters like http://something.com?param_name=param_value

      I suspect regex library Tilman uses does not support all Perl things.

      So lets simpify our regex. It must use only basic regex syntax- so, we can
      be sure Xenu can process it.

      1) What is \w ?
      It catches ether a letter or a digit or underscore.
      So, it is just a shorthand for [_a-z0-9]
      So you can replace "\w" with "[_a-z0-9]" (without quotes) inside my regex

      2) As for \s ...
      While \s catches not anly spaces but also tabs and line breaks, I meant it
      here to catch spaces only.
      So you can replace all occurencies of "\s" with " " (without quotes) inside
      my regex

      3) Remove "(?:ftp|https?)://" part - we want relative URLs to be matched,
      too.

      4) regex library Tilman uses might not support non-capturing parenthesis.
      So we have to remove "(?:\s*,[^,]+?\s*)*" part
      This part matches ,'something' repated zero ore more times
      So it matches: ",'something','something','something','something'"
      And if there no height and width parameters in javascript call , it is also
      OK for regexp - because this part of match is optional - due to presense of
      question mark right after the plus.

      So we decided to remove this part. Having said "A" we have to say "B". I
      mean we have to remove "\s*\);" part also. I hope you understand why - if
      there is no more regex part to match parameters, the ");" tail will never
      match, always making the whole regexp to fail. We do not want that.

      5) regex library Tilman uses might not support lazy repetition. So we will
      make our repetitions gready. You know where to read about meaning of these
      words.
      So we repeat regex tokens not with "+?" but with "+".
      Luckily this does not change regex match anyway.

      Finally we get:

      javascript:[_a-z0-9]+ *\( *['"]([^'"]+)['"]

      In plain English it is:
      a word "javascript", followed by colon, followed by a word consisting from
      letters ranging from A to Z and/or digits and/or underscores, followed by
      zero or more spaces, followed by opening bracket, followed by zero or more
      spaces, followed by a single or a double quote, followed by a sequence of
      symbols that are not single nor double quotes, followed by a single or a
      double quote.

      We have trimmed the regex, and so we have made it less accurate. But if you
      are sure you feed only valid HTML to Xenu, this is no problem.

      > when i validate your regex with the regex tester found at
      > www.forta.com, it nicely captured a URL such as
      Nice to hear. It proves again that I have written my initial regex in a
      standard way.

      > Let me cut this discussion short: did you test your regex with xenu 1.2g
      > beta?
      No. I temporarily have no live internet access, but only email.
      As soon as it is fixed, I will test and post my results.

      Regards,
      Eugeny





      Yahoo! Groups Sponsor



      ADVERTISEMENT

      <http://us.ard.yahoo.com/SIG=129ojbq5p/M=294855.5468653.6549235.3001176/D=gr
      oups/S=1705005512:HM/EXP=1097230538/A=2376776/R=0/SIG=11ldm1jvc/*http:/promo
      tions.yahoo.com/ydomains2004/index.html> click here



      <http://us.adserver.yahoo.com/l?M=294855.5468653.6549235.3001176/D=groups/S=
      :HM/A=2376776/rand=834031816>



      _____

      Yahoo! Groups Links

      * To visit your group on the web, go to:
      http://groups.yahoo.com/group/xenu-usergroup/

      * To unsubscribe from this group, send an email to:
      xenu-usergroup-unsubscribe@yahoogroups.com
      <mailto:xenu-usergroup-unsubscribe@yahoogroups.com?subject=Unsubscribe>

      * Your use of Yahoo! Groups is subject to the Yahoo!
      <http://docs.yahoo.com/info/terms/> Terms of Service.



      [Non-text portions of this message have been removed]
    • Tilman Hausherr
      Did anyone try to run Xenu with Sandboxie? Does it work? I m thinking about mentioning it on my web page so that paranoid people can use Xenu too :) Tilman
      Message 35 of 35 , Sep 18, 2010
      • 0 Attachment
        Did anyone try to run Xenu with Sandboxie? Does it work? I'm thinking
        about mentioning it on my web page so that paranoid people can use Xenu
        too :)

        Tilman
      Your message has been successfully submitted and would be delivered to recipients shortly.