Loading ...
Sorry, an error occurred while loading the content.

Re: Need help figuring out special RegExpr rule

Expand Messages
  • flo.gehrke
    ... I doubt that a RegEx fulfilling all your five spam-conditions could reliably distinguish between valid (non-spam) and invalid (spam) strings here in ANY
    Message 1 of 8 , May 28, 2012
    View Source
    • 0 Attachment
      --- In ntb-clips@yahoogroups.com, "Wizcrafts" <wizcrafts@...> wrote:
      >
      > I am stumped trying to create a regular expression filter that
      > will match all of the following conditions. It is part of a larger
      > spam filter.
      >
      > Conditions:
      >
      > 1: There is a continuous group consisting of exactly 8
      > alphanumeric characters, all standard ASCII. Inside this group
      > are the following conditions:
      >
      > 2: There are at least 2 uppercase letters
      >
      > 3: There are at least 2 lowercase letters
      >
      > 4: there is at least one number between 0-9 (usually 1 or 2)
      >
      > 5: There are no other characters of any kind (no spaces,
      > punctuation, etc)
      >
      > Here are 3 examples of the group I am trying to match:
      >
      > tR4hGGUK
      > 2UJeiy9m
      > WbNDSk9e
      >

      I doubt that a RegEx fulfilling all your five spam-conditions could reliably distinguish between valid (non-spam) and invalid (spam) strings here in ANY case.

      As you wrote, 'Mytrip12' is a valid string. But as soon as it is written 'MyTrip12' it will be regarded as spam according to your rule #2 (at least two uppercase letters).

      Nevertheless, you might want to give the following RegEx a try:

      \b(?=.*([[:upper:]][[:alnum:]]*[[:upper:]])+.*)(?=.*([[:lower:]][[:alnum:]]*[[:lower:]])+.*)(?=.*\d(\D*\d)+.*)(?!.*[\x20\pP].*).{8}\b

      Tested against...

      tR4hGGUK
      2UJeiy9m
      WbNDSk9e
      Mytrip12
      My.tip12

      it will match line #1, #2, and #3. Line #4 and #5 won't get matched because at least one of those five conditions isn't true here.

      Regards,
      Flo
    • Axel Berger
      ... You re a true artist. I would have looped a ^!Find and tested the ^$GetSelection$ in several steps. Axel
      Message 2 of 8 , May 28, 2012
      View Source
      • 0 Attachment
        "flo.gehrke" wrote:
        > Nevertheless, you might want to give the following RegEx a try:

        You're a true artist. I would have looped a ^!Find and tested the
        ^$GetSelection$ in several steps.

        Axel
      • flo.gehrke
        ... Hi Axel, IMHO, this wouldn t help. The point is to match all five criteria at the same position in the sense of a Boolean AND, not an OR (as achieved with
        Message 3 of 8 , May 28, 2012
        View Source
        • 0 Attachment
          --- In ntb-clips@yahoogroups.com, Axel Berger <Axel-Berger@...> wrote:
          >
          > I would have looped a ^!Find and tested the
          > ^$GetSelection$ in several steps.
          >

          Hi Axel,

          IMHO, this wouldn't help. The point is to match all five criteria at the same position in the sense of a Boolean AND, not an OR (as achieved with several ^!Find). So the only way is to concatenate a series of assertions that test the same string.

          However, I think this whole approach is probably on the wrong track. Detecting spam needs much more than analyzing URLs...

          Regards,
          Flo
        • Axel Berger
          ... Quite, but that s not what I said. ^!Find the URL and highlight only the eight character group with () and R1. Does ^$GetSelection$ contain two capitals?
          Message 4 of 8 , May 28, 2012
          View Source
          • 0 Attachment
            "flo.gehrke" wrote:
            > The point is to match all five criteria at the same position in
            > the sense of a Boolean AND, not an OR (as achieved with several
            > ^!Find).

            Quite, but that's not what I said.

            ^!Find the URL and highlight only the eight character group with () and
            R1.
            Does ^$GetSelection$ contain two capitals?
            Does it contain 2 lower case letters?
            Does it contain a digit?
            (Other charaters were dealt with in the find)

            Three times yes: it's spam
            No to any of the above: No spam, ^!Goto loop

            As I said before, I'm a simple man and like primitive solutions, but
            this step by step would fulfil the conditions. And of course being so
            primitive makes it very easy to come back to later and modify, if
            needed.

            Axel
          • Don Daugherty
            ... I recently came across your message from almost a year ago. Did you get a solution?
            Message 5 of 8 , May 7, 2013
            View Source
            • 0 Attachment
              On 5/26/2012 12:43 AM, Wizcrafts wrote:
              > I am stumped trying to create a regular expression filter that will match all of the following conditions. It is part of a larger spam filter.
              >
              > Conditions:
              >
              > 1: There is a continuous group consisting of exactly 8 alphanumeric characters, all standard ASCII. Inside this group are the following conditions:
              >
              > 2: There are at least 2 uppercase letters
              >
              > 3: There are at least 2 lowercase letters
              >
              > 4: there is at least one number between 0-9 (usually 1 or 2)
              >
              > 5: There are no other characters of any kind (no spaces, punctuation, etc)
              >
              > Here are 3 examples of the group I am trying to match:
              >
              > tR4hGGUK
              > 2UJeiy9m
              > WbNDSk9e
              >
              > These groups of 8 are all different from one another and are used in spam runs leading to the BlackHole Exploit Kit. I need a specific Regular Expression that will detect this type of mixed case alphanumeric directory name and trigger my filter.
              >
              > Right now I am using the simplistic condition: (?i)[a-zA-Z0-9]{8}
              >
              > The group rests inside the following filter:
              >
              > "http://[a-z0-9-_.]+\.[a-z]{2,4}/(?i)[a-zA-Z0-9]{8}/index\.html"
              >
              > The only switch I can use is the case switch: (?i)
              >
              > The trailing letter switches appended in NoteTab REs will not work in the program I write the filters for (no RAWS, etc).
              >
              > Unfortunately, my current rule would also match Mytrip12, which is nothing like the gibberish characters shown above. In all of the URLs I have analyzed, they always have a mix of upper and lower case letters and one or two numbers, probably formed by a random character generator. I have not seen a recognizable word yet.
              >
              > Thanks in advance for any help.
              >
              >
              I recently came across your message from almost a year ago. Did you get
              a solution?
            • Alec Burgess
              This turns out to require a neat usage of positive lookaheads. I think it was sometime in the last year that I gained the AHA insight that a positive
              Message 6 of 8 , May 7, 2013
              View Source
              • 0 Attachment
                This turns out to require a neat usage of positive lookaheads. I think
                it was sometime in the last year that I gained the "AHA" insight that a
                positive lookahead could precede the desired match rather than follow
                it. (from Flo? or perhaps RegexBuddy forum?) and be used to discard
                potential matches with a less restrictive description

                Here's the regexp to identify matching 8 character groups (on multiple
                lines for readability):
                \b
                (?=.{0,8}[a-z].{0,8}[a-z]) # within 2 runs of up to 8 characters insist
                on 2 lowercase else NOMATCH
                (?=.{0,8}[A-Z].{0,8}[A-Z]) # ditto for uppercase else NOMATCH
                (?=.{0,8}[0-9]) # within 1 run of 8 characters insist on 1 digit (more
                than 1 is still allowed but not mandatory) else NOMATCH
                ([a-zA-Z0-9]{8}) # finally match the 8 upper/lower/digit combo subject
                to preceding checks.
                \b

                Astute readers will notice this is not quite perfect - eg the second
                lower case could actually be matched in a second word.

                On one line suitable for use in a clip:
                \b(?=.{0,8}[a-z].{0,8}[a-z])(?=.{0,8}[A-Z].{0,8}[A-Z])(?=.{0,8}[0-9])([a-zA-Z0-9]{8})\b

                And same thing within the wrapping that requires the match be within a
                URL as Wizcraft requested (probably long-line):
                "http://[a-z0-9-_.]+\.[a-z]{2,4}/(?=.{0,8}[a-z].{0,8}[a-z])(?=.{0,8}[A-Z].{0,8}[A-Z])(?=.{0,8}[0-9])([a-zA-Z0-9]{8})/index\.html"

                This matches: --> "http://somedomain.comx/aaAA112b/index.html"
                but not ------------> "http://somedomain.comx/Mytrip12/index.html"

                Note \.[a-z]{2,4} to match .com, .uk, .net etc isn't quite correct since
                it misses .on.ca etc but could be tweaked if needed.

                On 2013-05-07 18:16, Don Daugherty wrote:
                > On 5/26/2012 12:43 AM, Wizcrafts wrote:
                > > I am stumped trying to create a regular expression filter that will
                > match all of the following conditions. It is part of a larger spam filter.
                > >
                > > Conditions:
                > >
                > > 1: There is a continuous group consisting of exactly 8 alphanumeric
                > characters, all standard ASCII. Inside this group are the following
                > conditions:
                > >
                > > 2: There are at least 2 uppercase letters
                > >
                > > 3: There are at least 2 lowercase letters
                > >
                > > 4: there is at least one number between 0-9 (usually 1 or 2)
                > >
                > > 5: There are no other characters of any kind (no spaces,
                > punctuation, etc)
                > >
                > > Here are 3 examples of the group I am trying to match:
                > >
                > > tR4hGGUK
                > > 2UJeiy9m
                > > WbNDSk9e
                > >
                > > These groups of 8 are all different from one another and are used in
                > spam runs leading to the BlackHole Exploit Kit. I need a specific
                > Regular Expression that will detect this type of mixed case
                > alphanumeric directory name and trigger my filter.
                > >
                > > Right now I am using the simplistic condition: (?i)[a-zA-Z0-9]{8}
                > >
                > > The group rests inside the following filter:
                > >
                > > "http://[a-z0-9-_.]+\.[a-z]{2,4}/(?i)[a-zA-Z0-9]{8}/index\.html"
                > >
                > > The only switch I can use is the case switch: (?i)
                > >
                > > The trailing letter switches appended in NoteTab REs will not work
                > in the program I write the filters for (no RAWS, etc).
                > >
                > > Unfortunately, my current rule would also match Mytrip12, which is
                > nothing like the gibberish characters shown above. In all of the URLs
                > I have analyzed, they always have a mix of upper and lower case
                > letters and one or two numbers, probably formed by a random character
                > generator. I have not seen a recognizable word yet.
                > >
                > > Thanks in advance for any help.
                > >
                > >
                > I recently came across your message from almost a year ago. Did you get
                > a solution?
              • flo.gehrke
                ... Alec, That s an interesting alternative to the pattern I posted with... http://tech.groups.yahoo.com/group/ntb-clips/message/22759 on May 28,2012. However,
                Message 7 of 8 , May 10, 2013
                View Source
                • 0 Attachment
                  --- In ntb-clips@yahoogroups.com, Alec Burgess <buralex@...> wrote:
                  >
                  > On one line suitable for use in a clip:
                  > \b(?=.{0,8}[a-z].{0,8}[a-z])(?=.{0,8}[A-Z].{0,8}[A-Z])(?=.{0,8}[0-9])([a-zA-Z0-9]{8})\b

                  Alec,

                  That's an interesting alternative to the pattern I posted with...

                  http://tech.groups.yahoo.com/group/ntb-clips/message/22759

                  on May 28,2012.

                  However, it causes some backtracking that you may want to avoid. Tested with Regex Coach against 'tR4HGGUK', your pattern needs 120 steps to achieve a match.

                  My proposal of May 2012 -- without Posix character classes and changed a bit...

                  \b(?=.*?[a-z].*?[a-z])(?=.*?[A-Z].*?[A-Z])(?=[^\d]*\d)[a-zA-Z0-9]{8}\b

                  needs only 45 steps in this case. This could also be written as...

                  \b(?=.*?([[:lower:]]).*?(?1))(?=.*?([[:upper:]]).*?(?2))(?=[^\d]*\d.*$)[[:alnum:]]{8}\b

                  For checking...

                  > This matches: --> "http://somedomain.comx/aaAA112b/index.html"
                  > but not ------------> "http://somedomain.comx/Mytrip12/index.html"

                  we would have to use...

                  http://somedomain.comx/(?=.*?([[:lower:]]).*?(?1))(?=.*?([[:upper:]]).*?(?2))(?=[^\d]*\d.*$)[[:alnum:]]{8}/index.html

                  (All patterns in one long line -- no line breaks!)

                  Regards,
                  Flo
                Your message has been successfully submitted and would be delivered to recipients shortly.