Loading ...
Sorry, an error occurred while loading the content.

Stripping Query Strings from URLs

Expand Messages
  • Cory Reed
    Does anyone know an easy way to strip query strings from URLs for analysis? We append query strings like ?campaign= and kw= to all of our campaign pages,
    Message 1 of 6 , Dec 4, 2007
    • 0 Attachment
      Does anyone know an easy way to strip query strings from URLs for
      analysis? We append query strings like "?campaign=" and "kw=" to all
      of our campaign pages, which is great for tracking the source of a
      lead. However, it gets hairy when trying to analyze an individual URL
      because there may be hundreds of variations on the same URL. Ideally,
      I'm looking for a quick SQL or MS Access command to parse this out.

      Thanks,

      Cory Reed
    • Anil Batra
      Something like this should work in SQL RIGHT ( character_expression , integer_expression ) LEFT ( character_expression , integer_expression ) SUBSTRING (
      Message 2 of 6 , Dec 5, 2007
      • 0 Attachment
        Something like this should work in SQL

        RIGHT ( character_expression , integer_expression )

        LEFT ( character_expression , integer_expression )

        SUBSTRING ( expression , start , length )

        PATINDEX ( '%pattern%' , expression )
        Contact me offline and I can help



        On Dec 4, 2007 12:10 PM, Cory Reed <creed@...> wrote:

        > Does anyone know an easy way to strip query strings from URLs for
        > analysis? We append query strings like "?campaign=" and "kw=" to all
        > of our campaign pages, which is great for tracking the source of a
        > lead. However, it gets hairy when trying to analyze an individual URL
        > because there may be hundreds of variations on the same URL. Ideally,
        > I'm looking for a quick SQL or MS Access command to parse this out.
        >
        > Thanks,
        >
        > Cory Reed
        >
        >
        >



        --
        Anil Batra
        http://webanalysis.blogspot.com
        Web Analytics, Behavioral Targeting and Online Advertising Blog


        [Non-text portions of this message have been removed]
      • Stephen Turner
        ... Sorry, I don t have a command for you, only a comment. What you really want to do is to remove some of the parameters but not others. Some parameters lead
        Message 3 of 6 , Dec 5, 2007
        • 0 Attachment
          --- In webanalytics@yahoogroups.com, "Cory Reed" <creed@...> wrote:
          >
          > Does anyone know an easy way to strip query strings from URLs for
          > analysis? We append query strings like "?campaign=" and "kw=" to all
          > of our campaign pages, which is great for tracking the source of a
          > lead. However, it gets hairy when trying to analyze an individual URL
          > because there may be hundreds of variations on the same URL. Ideally,
          > I'm looking for a quick SQL or MS Access command to parse this out.
          >

          Sorry, I don't have a command for you, only a comment. What you really
          want to do is to remove some of the parameters but not others. Some
          parameters lead to genuinely different pages, and you probably want to
          distinguish between them (think of /product.php?sku=14753 vs
          /product.php?sku=20144). Others, like your example, are just for
          tracking purposes and you want to ignore them for most types of analysis.

          Which analytics program are you using? You may find that it does this
          for you, or can be configured to do it.

          --
          Stephen Turner
          CTO, ClickTracks http://www.clicktracks.com/
        • Steve
          Ah. That was a nice bit of first thing in the morning fun/challenge. Thanks for that Cory! :-) This solution[1] uses an old command-line unix tool called
          Message 4 of 6 , Dec 5, 2007
          • 0 Attachment
            Ah. That was a nice bit of "first thing in the morning" fun/challenge.
            Thanks for that Cory! :-)

            This solution[1] uses an old command-line unix tool called 'sed': Stream EDitor.
            For those on Windows, you can get the CYGWIN package and run sed from
            there. I'm unsure of the capability of the Microsoft Unix Tools for
            Windows - the sed in there may not be as capable - if only by the more
            complex RegEx's used. Original sed has a very limited RegEx toolkit.


            $ sed -re 's/([?&])(PARAMNAME=[^& ]+([ &]|$))/\1/' LOGFILE

            Where 'PARAMNAME' is the name of the parameter you're trying to zot.
            ie. campaign= or kw=


            To break it apart for learning purposes:
            We're doing a search & replace operation over every line in a give LOGFILE.

            The search:
            ([?&])(PARAMNAME=[^& ]+([ &]|$))

            Looking for something starting with a ? or & - we want to hang on to
            this hence the brackets around it.
            ([?&])

            Next part, could probably remove it's enclosing brackets. No matter.
            Want the Parameter itself. It is follow by an equals sign '='
            Next we want a NOT & or Space: [^& ]+ One or more of these. Q; Is it
            possible to have a zero length value? Is so, replace the plus-sign
            with a star '*'.

            The next part finds either a space an Ampersand '&' or an end of line
            marker. Depends on how you're pulling the URL apart, this is the bit
            that may need additional work.


            The replace is a little tricky to the uninitiated. Assumptions are rife.
            \1

            Yeah right. Huh! :-)

            Recall we kept the initial leading ? or & ???
            This \1 states - insert our first grouping ie via (). \2 would be the
            2nd, \3 the third, and so on.

            What happened to the rest? This is the tricky part. As we didn't
            specify what it gets replaced with, it gets replaced with nothing. ie
            zotted as per request.

            BTW. Do be careful. I've only done ultra quick testing. It looks good
            to me, but I may have missed something. I accidentally clobbered one
            of the brackets when typing this email. So misteaks are possible. :-)

            HTH!

            Cheers!
            - Steve

            [1] Bound to be other solutions. This is the style of fast and easy
            thing I use all the time. YMMV.


            On Dec 5, 2007 7:10 AM, Cory Reed <creed@...> wrote:
            > Does anyone know an easy way to strip query strings from URLs for
            > analysis? We append query strings like "?campaign=" and "kw=" to all
            > of our campaign pages, which is great for tracking the source of a
            > lead. However, it gets hairy when trying to analyze an individual URL
            > because there may be hundreds of variations on the same URL. Ideally,
            > I'm looking for a quick SQL or MS Access command to parse this out.
          • Scott Zakrajsek
            Hi Cory- In SQL i would recommend something like: select LEFT(starting_url, INSTR(starting_url, ? )-1) as stripped_url; the INSTR function may not work w/ all
            Message 5 of 6 , Dec 5, 2007
            • 0 Attachment
              Hi Cory-

              In SQL i would recommend something like:

              select LEFT(starting_url, INSTR(starting_url, '?')-1) as stripped_url;

              the INSTR function may not work w/ all platforms. If that doesnt work,
              I'd try LOCATE, POSITION, or CHARINDEX instead.

              you can read up on them here:
              http://dev.mysql.com/doc/refman/5.0/en/string-functions.html
              <http://dev.mysql.com/doc/refman/5.0/en/string-functions.html>

              Scott


              --- In webanalytics@yahoogroups.com, "Cory Reed" <creed@...> wrote:
              >
              > Does anyone know an easy way to strip query strings from URLs for
              > analysis? We append query strings like "?campaign=" and "kw=" to all
              > of our campaign pages, which is great for tracking the source of a
              > lead. However, it gets hairy when trying to analyze an individual URL
              > because there may be hundreds of variations on the same URL. Ideally,
              > I'm looking for a quick SQL or MS Access command to parse this out.
              >
              > Thanks,
              >
              > Cory Reed
              >




              [Non-text portions of this message have been removed]
            • ss
              Cory, I have some advanced SQL data parsing code. feel free to email me directly and I can help. what you need is a combination of the following: a)
              Message 6 of 6 , Dec 5, 2007
              • 0 Attachment
                Cory,

                I have some advanced SQL data parsing code. feel free
                to email me directly and I can help.

                what you need is a combination of the following:

                a) charindex: to find the position of the character
                you are using as a delimiter e.g. I want to find the
                position of the first '/' and use it. In your case,
                there would be two, one being the "?" to denote the
                start of the query and "&" to denote the begin/end of
                a name-value pair.

                b) substring: to capture the value of the name and the
                value.

                Using the two above, you can really capture all the
                name value pairs. In fact, using charindex makes the
                query really dynamic (i.e. re-usable!)

                +Satnam


                ____________________________________________________________________________________
                Be a better friend, newshound, and
                know-it-all with Yahoo! Mobile. Try it now. http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ
              Your message has been successfully submitted and would be delivered to recipients shortly.