Loading ...
Sorry, an error occurred while loading the content.
 

Re: [webanalytics] Re: Wanted: a tool that can scan a site to report on integrity of tracking code

Expand Messages
  • Steve
    ... Well not quite... :-) ... Do it in perl, and slurp the entire file in as a single line . RegEx s to ignore line splits are trivial to write. Common even.
    Message 1 of 12 , Sep 21, 2006
      On 9/22/06, Debbie Pascoe <dpascoe@...> wrote:
      > I checked with one of our Unix experts, who says that you would be
      > able to use wget and grep to download a page and look for page tags
      > and links and check the page tagging integrity, assuming that:

      Well not quite... :-)


      > - Your page tags are always in the same format (i.e. the code is not
      > split across lines differently)

      Do it in perl, and slurp the entire file in as a "single line".
      RegEx's to ignore line splits are trivial to write. Common even. A
      "while loop" to cycle thru the matches is a fairly common construct.
      :-)
      Even multi megabyte htm lfiles would be easy this way. And perl is
      *designed* to do this.

      Or parse thru tidy first.
      Or something similar. Plenty of cool libraries on CPAN to checkout.


      > - Your site only uses plain links, no JavaScript menus or onClick links

      Should be very easy to handle those. *Something* can interpret them
      programatically (ie a web browser), reversing that wouldn't be hard at
      all. It's not like these are undocumented standards. Look for the
      pattern(s). Match them.

      If you wanted to get super sophisticated, you could grab the
      javascript code libraries from the firefox codebase and use those to
      handle any and all javascript and html translation issues.


      > - You can navigate your site without using flash

      Never looked inside flash myself. so you could be very correct. :-)
      Tho the GNU flash program (Gnash) may have appropriate libs that could
      be borrowed or interfaced to, to do this. It's a lot harder, but still
      achievable. The joy of Open Source: you don't have to reinvent the
      wheel.


      > - You can confidently assume that all of the other JavaScript in your
      > pages will not cause the page tagging JavaScript to fail

      Well that's just normal debugging. But point taken and agreed.
      Additional: Your acceptance testing should be picking this up.
      Key word: "Should". :-)


      > - Your site doesn't use <base> html elements

      If a spidering tool can't handle the base tag, or any legitimate HTML
      tag, then the tool is broken. Submit bug report, get fix overnight.
      Problem solved. :-)


      > - Your grep command handles <a> elements that are split over multiple
      > lines, and multiple <a> elements per line

      See above. Being able to split up multiple things from a single chunk
      of data is a common task. Perhaps grep is not the best tool, but awk,
      sed and perl certainly are more than capable.


      > - Your grep command handles other navigation elements, such as <frame>
      > or <iframe>

      Two step it. Slurp the site, and glob all files in the resulting tree
      with find or some such.
      Or: I'm sure there's some simple spidering libraries in CPAN for
      libwww. I seem to recall I came across some when I wrote our internal
      Perl based link checker a few years ago.


      > - Your site doesn't need a login or session cookies to be viewed.

      The advice is incorrect, wget can handle both. Or use curl which is
      scarily sophisticated and powerful.
      From the wget man page:
      --user=user
      --password=password
      --keep-session-cookies

      All of which do pretty much what they say. :-)


      > These are some conditions - not a complete list - that would keep this
      > from being an optimal approach. The best alternative is to utilize a
      > product that can handle all those conditions, regardless of the site's
      > operating system, web server, content creation solution or
      > methodology, or web analytics vendor.

      Sure. No real disagreement. But there's also heaps of trivial
      solutions that could be used to make life easier too and avoid many of
      these issues.


      Aside: The big plus that Unix (as a collective) has over many other
      systems is the most amazing array of simple but highly powerful tools
      that can be easily glued together to do tasks that would be several
      days of effort in any programming language.


      Maybe wget and grep won't do it. But a combo of wget, find, sed, tidy,
      egrep, uniq, sort and wc may. It may not be perfect, but it may get
      you easily 70% of the way there. And 70% is a huge improvement on 0%.


      The flip side is that if the problems raised are not existent for a
      simple site, then they are not problems, and Tim's suggestion still
      stands. And if you ain't got the money, you ain't got the money. :-)


      I'm really quite tempted to accept the thrown gauntlet just to truely
      satisfy my own burning curiosity as to how hard or easy the problem
      actually is. Vs making an educated guess. Sounds like an interesting
      challenge to burn a few hours or so. And it's been a few months since
      I've done any serious Perl hacking. Hmmmm. And I could finally have a
      decent excuse to try out the perl pthread libraries. Have had lots of
      fun with those in C programs I've written. Hmmmmmmmmmmm....


      It's not too unreasonable to assume that javascript page tagging will
      become more sophisticated in Open Source analysis packages in the
      future. Awstats has a very simple one already. And with that
      increasing sophistication, the need for a matching solution to verify
      same becomes necessary too. A solution will follow as night surely
      follows day.


      I would argue not so much how hard or easy the problem is, but rather
      argue on the additional value that Maxamine adds and brings to solving
      the problems. Technical solutions are achievable, value add is harder.
      Marketing and sustaining that value add is something else again.


      >
      > Debbie Pascoe
      > MAXAMINE, Inc.
      >
      > >
      > > If you are UNIX-literate (or have such folks available to you), a simple
      > > 'wget/grep' command should be enough.
      > >


      Cheers!


      - Steve, Unix Guru.

      Tho I believe my actual position title, for what little meaning or
      even relevance a position title holds, is: "Senior Unix Systems
      Administrator".

      Guru sums it up nicely and clears away the clutter. :-)
    • Stephane Hamel
      Hi Lesley, you might want to check my post at http://shamel.blogspot.com/2006/09/web-analytics-solution-profiler-wasp.html where I list a couple of available
      Message 2 of 12 , Sep 22, 2006
        Hi Lesley,
        you might want to check my post at
        http://shamel.blogspot.com/2006/09/web-analytics-solution-profiler-wasp.html
        where I list a couple of available solutions, but more interestingly,
        I expose my Web Analytics Solution Profiler (WASP) idea.

        Comments and sugestions from this group are welcome!

        Stephane

        --- In webanalytics@yahoogroups.com, "hunter_analytics"
        <hunter_analytics@...> wrote:
        >
        > Hi
        >
        > Does anyone know of or use a tool that is able to scan a site to
        > report on the integrity of the tracking page tags? Whether the code
        > is missing, rendered incorrectly etc.
        >
        > Any hints or pointers in right direction would be much appreciated.
        >
        > Regards
        >
        > Lesley
        >
      • Debbie Pascoe
        Steve, It was great fun reading your point-by-point response, and I can tell that you enjoyed the exercise :-) Your last observation is the crucial thing. In
        Message 3 of 12 , Sep 25, 2006
          Steve,

          It was great fun reading your point-by-point response, and I can tell
          that you enjoyed the exercise :-)

          Your last observation is the crucial thing. In this exchange, we have
          discussed solving one particular site structure problem (determining
          whether the tag implementation is correct and complete).

          Site owners have to deal with many other issues, like checking for
          site defects, evaluating their privacy implementations, determining if
          blind people can interact with their site (ref. the current lawsuit
          against Target by the National Federation of the Blind), being able to
          respond at a moment's notice if they drop a product or terminate an
          exec and need to remove all references, and more recently, monitoring
          their employee-written blogs to be sure authors and respondents are
          staying within bounds. These are just a few issues that we see every
          day - there are lots of others I didn't mention.

          Websites are growing increasingly large, complex and dynamic,
          compounded by an increasing number of user-contributors. Both the
          challenge and the importance of maintaining the quality of website
          implementations are on the rise. We are seeing this clearly within
          companies that seek us out.

          The added value that you so correctly point out is in treating these
          issues as multiple facets of the same problem, and addressing them all
          with one high performance site analytics solution.

          Debbie Pascoe
          MAXAMINE, Inc.


          --- In webanalytics@yahoogroups.com, Steve <nuilvows@...> wrote:
          >
          > On 9/22/06, Debbie Pascoe <dpascoe@...> wrote:
          > > I checked with one of our Unix experts, who says that you would be
          > > able to use wget and grep to download a page and look for page tags
          > > and links and check the page tagging integrity, assuming that:
          >
          > Well not quite... :-)
          .
          .
          .
          .
          .
          > I would argue not so much how hard or easy the problem is, but rather
          > argue on the additional value that Maxamine adds and brings to solving
          > the problems. Technical solutions are achievable, value add is harder.
          > Marketing and sustaining that value add is something else again.

          > - Steve, Unix Guru.
          >
          > Tho I believe my actual position title, for what little meaning or
          > even relevance a position title holds, is: "Senior Unix Systems
          > Administrator".
          >
          > Guru sums it up nicely and clears away the clutter. :-)
          >
        Your message has been successfully submitted and would be delivered to recipients shortly.