Loading ...
Sorry, an error occurred while loading the content.

Accomodate for (J|j)ava(s|S)cript in regex

Expand Messages
  • frank visser
    hi all, i am trying to upgrade the regex javascript: *[_a-zA-Z0-9]+ * ( *[ ]((/|ftp://|https?://)[^ ]+)[ ] for all cases of javascript , Javascript ,
    Message 1 of 4 , Jan 16, 2005
    View Source
    • 0 Attachment
      hi all,

      i am trying to upgrade the regex
      javascript: *[_a-zA-Z0-9]+ *\( *['"]((/|ftp://|https?://)[^'"]+)['"]

      for all cases of "javascript", "Javascript", "javaScript"
      and "JavaScript".

      as follows:

      (J|j)ava(s|S)cript: *[_a-zA-Z0-9]+ *\( *['"]((/|ftp://|https?://)
      [^'"]+)['"]

      but this causes "broken links" in Xenu to show up of the type:

      http://www.site.com/j
      http://www.site.com/subfolder/s

      what am i doing wrong?

      frank
    • Josh Goldman
      Shouldn t that be square brackets not parentheses. that is [J|j]ava[s|S]cript: *[_a-zA-Z0-9]+ * ( *[ ](/|ftp://|https?://)[^ ]+)[ ] unquoted parentheses ( )
      Message 2 of 4 , Jan 16, 2005
      View Source
      • 0 Attachment
        Shouldn't that be square brackets not parentheses. that is

        [J|j]ava[s|S]cript: *[_a-zA-Z0-9]+ *\( *['"](/|ftp://|https?://)[^'"]+)['"]

        unquoted parentheses ( ) indicate the section of the string that you will be
        referencing with \1 or \2, where a square bracket is being used to group
        characters for | or.

        In the correct string, the first unquoted ( should be after the initial
        ['|"]. If you have an unquoted () before it, in this case "(J|j)", then Xenu
        will try to find the link using "J" rather than the actual http string since
        it is probably taking the result of the regular expression and getting the
        value of \1.

        You also seem to have an extra parenthesis before the ftp. ['"]((/|ftp

        It's been a while since I've worked with regexp so it is possible that I am
        wrong, but here's my explanation of the regexp

        [J|j]ava[s|S]cript: *[_a-zA-Z0-9]+ *\( *['"](/|ftp://|https?://)[^'"]+)['"]

        match a string
        that starts with either J or j
        followed by ava
        then either s or S
        followed by cript:
        then 0 or more space characters
        then a function name consisting of 1 or more characters from the set _, a-z,
        A-X, and 0-9
        then 0 or more space characters
        then the literal ( left parenthesis
        then 0 or more space characters
        then either ' or "
        the following string will be returned as \1
        Either / or ftp:// or https:// or http:// s? means 0 or 1
        s
        followed by one or more characters that can be anything except ' or "
        End of \1 string
        Followed by ' or "

        This regexp won't catch local file references, such as
        Javascript:Open("foo.html")
        You could possibly fix that by putting a ? after (/|ftp://|https?://)

        Message: 2
        Date: Sun, 16 Jan 2005 10:11:26 -0000
        From: "frank visser" <f.visser3@...>
        Subject: Accomodate for (J|j)ava(s|S)cript in regex


        hi all,

        i am trying to upgrade the regex
        javascript: *[_a-zA-Z0-9]+ *\( *['"]((/|ftp://|https?://)[^'"]+)['"]

        for all cases of "javascript", "Javascript", "javaScript"
        and "JavaScript".

        as follows:

        (J|j)ava(s|S)cript: *[_a-zA-Z0-9]+ *\( *['"]((/|ftp://|https?://)
        [^'"]+)['"]

        but this causes "broken links" in Xenu to show up of the type:
      • frank visser
        hi josh, wish you were right, but no: http://www.regular-expressions.info/alternation.html never heard of pipe symbol used with [...]. will try out your
        Message 3 of 4 , Jan 16, 2005
        View Source
        • 0 Attachment
          hi josh,

          wish you were right, but no:
          http://www.regular-expressions.info/alternation.html

          never heard of pipe symbol used with [...].

          will try out your suggestion though.

          the reason i wanted to exclude ('foo.htm'as match is that i wanted
          to avoid ('benchmarks', etc., but you are right, i might include
          items that refer to a URL.

          will dig into that as well and let u know.

          frank

          --- In xenu-usergroup@yahoogroups.com, "Josh Goldman" <Josh-
          Goldman@r...> wrote:
          > Shouldn't that be square brackets not parentheses. that is
          >
          > [J|j]ava[s|S]cript: *[_a-zA-Z0-9]+ *\( *['"](/|ftp://|https?://)
          [^'"]+)['"]
          >
          > unquoted parentheses ( ) indicate the section of the string that
          you will be
          > referencing with \1 or \2, where a square bracket is being used to
          group
          > characters for | or.
          >
          > In the correct string, the first unquoted ( should be after the
          initial
          > ['|"]. If you have an unquoted () before it, in this case "(J|j)",
          then Xenu
          > will try to find the link using "J" rather than the actual http
          string since
          > it is probably taking the result of the regular expression and
          getting the
          > value of \1.
          >
          > You also seem to have an extra parenthesis before the ftp. ['"]
          ((/|ftp
          >
          > It's been a while since I've worked with regexp so it is possible
          that I am
          > wrong, but here's my explanation of the regexp
          >
          > [J|j]ava[s|S]cript: *[_a-zA-Z0-9]+ *\( *['"](/|ftp://|https?://)
          [^'"]+)['"]
          >
          > match a string
          > that starts with either J or j
          > followed by ava
          > then either s or S
          > followed by cript:
          > then 0 or more space characters
          > then a function name consisting of 1 or more characters from the
          set _, a-z,
          > A-X, and 0-9
          > then 0 or more space characters
          > then the literal ( left parenthesis
          > then 0 or more space characters
          > then either ' or "
          > the following string will be returned as \1
          > Either / or ftp:// or https:// or http:// s?
          means 0 or 1
          > s
          > followed by one or more characters that can be anything
          except ' or "
          > End of \1 string
          > Followed by ' or "
          >
          > This regexp won't catch local file references, such as
          > Javascript:Open("foo.html")
          > You could possibly fix that by putting a ? after
          (/|ftp://|https?://)
          >
          > Message: 2
          > Date: Sun, 16 Jan 2005 10:11:26 -0000
          > From: "frank visser" <f.visser3@c...>
          > Subject: Accomodate for (J|j)ava(s|S)cript in regex
          >
          >
          > hi all,
          >
          > i am trying to upgrade the regex
          > javascript: *[_a-zA-Z0-9]+ *\( *['"]((/|ftp://|https?://)[^'"]+)
          ['"]
          >
          > for all cases of "javascript", "Javascript", "javaScript"
          > and "JavaScript".
          >
          > as follows:
          >
          > (J|j)ava(s|S)cript: *[_a-zA-Z0-9]+ *\( *['"]((/|ftp://|https?://)
          > [^'"]+)['"]
          >
          > but this causes "broken links" in Xenu to show up of the type:
        • Joshua Goldman
          Eugeny got what I was thinking of ... Square brackets mean one of the set of characters (or if it starts with ^ any character but one of the set ). [J|j]
          Message 4 of 4 , Jan 17, 2005
          View Source
          • 0 Attachment
            Eugeny got what I was thinking of

            > do it like this
            > [Jj]ava[Ss]cript

            Square brackets mean "one of the set of characters" (or if it starts
            with ^ "any character but one of the set"). [J|j] means "either J or j
            or |", which would probably work in 99% of the cases since the string
            "|ava|cript:" is not going to occur too often.

            Thinking about trying to catch javascript:open("foo.html"). Problem
            is that anything that catches this will also catch any other
            Javascript that has a string for the first parameter. In my use of
            xenu, I didn't have this problem because I knew exactly what the
            javascript functions were.

            Thanks for pointing out the link to the regexp page.
            http://www.regular-expressions.info/alternation.html

            --- In xenu-usergroup@yahoogroups.com, "frank visser" <f.visser3@c...>
            wrote:
            >
            > hi josh,
            >
            > wish you were right, but no:
            > http://www.regular-expressions.info/alternation.html
            >
            > never heard of pipe symbol used with [...].
            >
            > will try out your suggestion though.
            >
            > the reason i wanted to exclude ('foo.htm'as match is that i wanted
            > to avoid ('benchmarks', etc., but you are right, i might include
            > items that refer to a URL.
            >
            > will dig into that as well and let u know.
            >
            > frank
            >
            > --- In xenu-usergroup@yahoogroups.com, "Josh Goldman" <Josh-
            > Goldman@r...> wrote:
            > > Shouldn't that be square brackets not parentheses. that is
            > >

            fixed thanks to Eugeny
            > > [Jj]ava[sS]cript: *[_a-zA-Z0-9]+ *\( *['"](/|ftp://|https?://)
            > [^'"]+)['"]
            > >
            > > unquoted parentheses ( ) indicate the section of the string that
            > you will be
            > > referencing with \1 or \2, where a square bracket is being used to
            > group
            > > characters for | or.
            > >
            > > In the correct string, the first unquoted ( should be after the
            > initial
            > > ['|"]. If you have an unquoted () before it, in this case "(J|j)",
            > then Xenu
            > > will try to find the link using "J" rather than the actual http
            > string since
            > > it is probably taking the result of the regular expression and
            > getting the
            > > value of \1.
            > >
            > > You also seem to have an extra parenthesis before the ftp. ['"]
            > ((/|ftp
            > >
            > > It's been a while since I've worked with regexp so it is possible
            > that I am
            > > wrong, but here's my explanation of the regexp
            > >
            > > [Jj]ava[sS]cript: *[_a-zA-Z0-9]+ *\( *['"](/|ftp://|https?://)
            > [^'"]+)['"]
            > >
            > > match a string
            > > that starts with either J or j
            > > followed by ava
            > > then either s or S
            > > followed by cript:
            > > then 0 or more space characters
            > > then a function name consisting of 1 or more characters from the
            > set _, a-z,
            > > A-X, and 0-9
            > > then 0 or more space characters
            > > then the literal ( left parenthesis
            > > then 0 or more space characters
            > > then either ' or "
            > > the following string will be returned as \1
            > > Either / or ftp:// or https:// or http:// s?
            > means 0 or 1
            > > s
            > > followed by one or more characters that can be anything
            > except ' or "
            > > End of \1 string
            > > Followed by ' or "
            > >
            > > This regexp won't catch local file references, such as
            > > Javascript:Open("foo.html")
            > > You could possibly fix that by putting a ? after
            > (/|ftp://|https?://)
            > >
            > > Message: 2
            > > Date: Sun, 16 Jan 2005 10:11:26 -0000
            > > From: "frank visser" <f.visser3@c...>
            > > Subject: Accomodate for (J|j)ava(s|S)cript in regex
            > >
            > >
            > > hi all,
            > >
            > > i am trying to upgrade the regex
            > > javascript: *[_a-zA-Z0-9]+ *\( *['"]((/|ftp://|https?://)[^'"]+)
            > ['"]
            > >
            > > for all cases of "javascript", "Javascript", "javaScript"
            > > and "JavaScript".
            > >
            > > as follows:
            > >
            > > (J|j)ava(s|S)cript: *[_a-zA-Z0-9]+ *\( *['"]((/|ftp://|https?://)
            > > [^'"]+)['"]
            > >
            > > but this causes "broken links" in Xenu to show up of the type:
          Your message has been successfully submitted and would be delivered to recipients shortly.