Loading ...
Sorry, an error occurred while loading the content.
 

Re: Wanted: a tool that can scan a site to report on integrity of tracking code

Expand Messages
  • Debbie Pascoe
    I checked with one of our Unix experts, who says that you would be able to use wget and grep to download a page and look for page tags and links and check the
    Message 1 of 12 , Sep 21, 2006
      I checked with one of our Unix experts, who says that you would be
      able to use wget and grep to download a page and look for page tags
      and links and check the page tagging integrity, assuming that:

      - Your page tags are always in the same format (i.e. the code is not
      split across lines differently)
      - Your site only uses plain links, no JavaScript menus or onClick links
      - You can navigate your site without using flash
      - You can confidently assume that all of the other JavaScript in your
      pages will not cause the page tagging JavaScript to fail
      - Your site doesn't use <base> html elements
      - Your grep command handles <a> elements that are split over multiple
      lines, and multiple <a> elements per line
      - Your grep command handles other navigation elements, such as <frame>
      or <iframe>
      - Your site doesn't need a login or session cookies to be viewed.

      These are some conditions - not a complete list - that would keep this
      from being an optimal approach. The best alternative is to utilize a
      product that can handle all those conditions, regardless of the site's
      operating system, web server, content creation solution or
      methodology, or web analytics vendor.

      Debbie Pascoe
      MAXAMINE, Inc.


      >
      > If you are UNIX-literate (or have such folks available to you), a simple
      > 'wget/grep' command should be enough.
      >
    • Steve
      ... Well not quite... :-) ... Do it in perl, and slurp the entire file in as a single line . RegEx s to ignore line splits are trivial to write. Common even.
      Message 2 of 12 , Sep 21, 2006
        On 9/22/06, Debbie Pascoe <dpascoe@...> wrote:
        > I checked with one of our Unix experts, who says that you would be
        > able to use wget and grep to download a page and look for page tags
        > and links and check the page tagging integrity, assuming that:

        Well not quite... :-)


        > - Your page tags are always in the same format (i.e. the code is not
        > split across lines differently)

        Do it in perl, and slurp the entire file in as a "single line".
        RegEx's to ignore line splits are trivial to write. Common even. A
        "while loop" to cycle thru the matches is a fairly common construct.
        :-)
        Even multi megabyte htm lfiles would be easy this way. And perl is
        *designed* to do this.

        Or parse thru tidy first.
        Or something similar. Plenty of cool libraries on CPAN to checkout.


        > - Your site only uses plain links, no JavaScript menus or onClick links

        Should be very easy to handle those. *Something* can interpret them
        programatically (ie a web browser), reversing that wouldn't be hard at
        all. It's not like these are undocumented standards. Look for the
        pattern(s). Match them.

        If you wanted to get super sophisticated, you could grab the
        javascript code libraries from the firefox codebase and use those to
        handle any and all javascript and html translation issues.


        > - You can navigate your site without using flash

        Never looked inside flash myself. so you could be very correct. :-)
        Tho the GNU flash program (Gnash) may have appropriate libs that could
        be borrowed or interfaced to, to do this. It's a lot harder, but still
        achievable. The joy of Open Source: you don't have to reinvent the
        wheel.


        > - You can confidently assume that all of the other JavaScript in your
        > pages will not cause the page tagging JavaScript to fail

        Well that's just normal debugging. But point taken and agreed.
        Additional: Your acceptance testing should be picking this up.
        Key word: "Should". :-)


        > - Your site doesn't use <base> html elements

        If a spidering tool can't handle the base tag, or any legitimate HTML
        tag, then the tool is broken. Submit bug report, get fix overnight.
        Problem solved. :-)


        > - Your grep command handles <a> elements that are split over multiple
        > lines, and multiple <a> elements per line

        See above. Being able to split up multiple things from a single chunk
        of data is a common task. Perhaps grep is not the best tool, but awk,
        sed and perl certainly are more than capable.


        > - Your grep command handles other navigation elements, such as <frame>
        > or <iframe>

        Two step it. Slurp the site, and glob all files in the resulting tree
        with find or some such.
        Or: I'm sure there's some simple spidering libraries in CPAN for
        libwww. I seem to recall I came across some when I wrote our internal
        Perl based link checker a few years ago.


        > - Your site doesn't need a login or session cookies to be viewed.

        The advice is incorrect, wget can handle both. Or use curl which is
        scarily sophisticated and powerful.
        From the wget man page:
        --user=user
        --password=password
        --keep-session-cookies

        All of which do pretty much what they say. :-)


        > These are some conditions - not a complete list - that would keep this
        > from being an optimal approach. The best alternative is to utilize a
        > product that can handle all those conditions, regardless of the site's
        > operating system, web server, content creation solution or
        > methodology, or web analytics vendor.

        Sure. No real disagreement. But there's also heaps of trivial
        solutions that could be used to make life easier too and avoid many of
        these issues.


        Aside: The big plus that Unix (as a collective) has over many other
        systems is the most amazing array of simple but highly powerful tools
        that can be easily glued together to do tasks that would be several
        days of effort in any programming language.


        Maybe wget and grep won't do it. But a combo of wget, find, sed, tidy,
        egrep, uniq, sort and wc may. It may not be perfect, but it may get
        you easily 70% of the way there. And 70% is a huge improvement on 0%.


        The flip side is that if the problems raised are not existent for a
        simple site, then they are not problems, and Tim's suggestion still
        stands. And if you ain't got the money, you ain't got the money. :-)


        I'm really quite tempted to accept the thrown gauntlet just to truely
        satisfy my own burning curiosity as to how hard or easy the problem
        actually is. Vs making an educated guess. Sounds like an interesting
        challenge to burn a few hours or so. And it's been a few months since
        I've done any serious Perl hacking. Hmmmm. And I could finally have a
        decent excuse to try out the perl pthread libraries. Have had lots of
        fun with those in C programs I've written. Hmmmmmmmmmmm....


        It's not too unreasonable to assume that javascript page tagging will
        become more sophisticated in Open Source analysis packages in the
        future. Awstats has a very simple one already. And with that
        increasing sophistication, the need for a matching solution to verify
        same becomes necessary too. A solution will follow as night surely
        follows day.


        I would argue not so much how hard or easy the problem is, but rather
        argue on the additional value that Maxamine adds and brings to solving
        the problems. Technical solutions are achievable, value add is harder.
        Marketing and sustaining that value add is something else again.


        >
        > Debbie Pascoe
        > MAXAMINE, Inc.
        >
        > >
        > > If you are UNIX-literate (or have such folks available to you), a simple
        > > 'wget/grep' command should be enough.
        > >


        Cheers!


        - Steve, Unix Guru.

        Tho I believe my actual position title, for what little meaning or
        even relevance a position title holds, is: "Senior Unix Systems
        Administrator".

        Guru sums it up nicely and clears away the clutter. :-)
      • Stephane Hamel
        Hi Lesley, you might want to check my post at http://shamel.blogspot.com/2006/09/web-analytics-solution-profiler-wasp.html where I list a couple of available
        Message 3 of 12 , Sep 22, 2006
          Hi Lesley,
          you might want to check my post at
          http://shamel.blogspot.com/2006/09/web-analytics-solution-profiler-wasp.html
          where I list a couple of available solutions, but more interestingly,
          I expose my Web Analytics Solution Profiler (WASP) idea.

          Comments and sugestions from this group are welcome!

          Stephane

          --- In webanalytics@yahoogroups.com, "hunter_analytics"
          <hunter_analytics@...> wrote:
          >
          > Hi
          >
          > Does anyone know of or use a tool that is able to scan a site to
          > report on the integrity of the tracking page tags? Whether the code
          > is missing, rendered incorrectly etc.
          >
          > Any hints or pointers in right direction would be much appreciated.
          >
          > Regards
          >
          > Lesley
          >
        • Debbie Pascoe
          Steve, It was great fun reading your point-by-point response, and I can tell that you enjoyed the exercise :-) Your last observation is the crucial thing. In
          Message 4 of 12 , Sep 25, 2006
            Steve,

            It was great fun reading your point-by-point response, and I can tell
            that you enjoyed the exercise :-)

            Your last observation is the crucial thing. In this exchange, we have
            discussed solving one particular site structure problem (determining
            whether the tag implementation is correct and complete).

            Site owners have to deal with many other issues, like checking for
            site defects, evaluating their privacy implementations, determining if
            blind people can interact with their site (ref. the current lawsuit
            against Target by the National Federation of the Blind), being able to
            respond at a moment's notice if they drop a product or terminate an
            exec and need to remove all references, and more recently, monitoring
            their employee-written blogs to be sure authors and respondents are
            staying within bounds. These are just a few issues that we see every
            day - there are lots of others I didn't mention.

            Websites are growing increasingly large, complex and dynamic,
            compounded by an increasing number of user-contributors. Both the
            challenge and the importance of maintaining the quality of website
            implementations are on the rise. We are seeing this clearly within
            companies that seek us out.

            The added value that you so correctly point out is in treating these
            issues as multiple facets of the same problem, and addressing them all
            with one high performance site analytics solution.

            Debbie Pascoe
            MAXAMINE, Inc.


            --- In webanalytics@yahoogroups.com, Steve <nuilvows@...> wrote:
            >
            > On 9/22/06, Debbie Pascoe <dpascoe@...> wrote:
            > > I checked with one of our Unix experts, who says that you would be
            > > able to use wget and grep to download a page and look for page tags
            > > and links and check the page tagging integrity, assuming that:
            >
            > Well not quite... :-)
            .
            .
            .
            .
            .
            > I would argue not so much how hard or easy the problem is, but rather
            > argue on the additional value that Maxamine adds and brings to solving
            > the problems. Technical solutions are achievable, value add is harder.
            > Marketing and sustaining that value add is something else again.

            > - Steve, Unix Guru.
            >
            > Tho I believe my actual position title, for what little meaning or
            > even relevance a position title holds, is: "Senior Unix Systems
            > Administrator".
            >
            > Guru sums it up nicely and clears away the clutter. :-)
            >
          Your message has been successfully submitted and would be delivered to recipients shortly.