Re: [webanalytics] Re: Wanted: a tool that can scan a site to report on integrity of tracking code
- On 9/22/06, Debbie Pascoe <dpascoe@...> wrote:
> I checked with one of our Unix experts, who says that you would beWell not quite... :-)
> able to use wget and grep to download a page and look for page tags
> and links and check the page tagging integrity, assuming that:
> - Your page tags are always in the same format (i.e. the code is notDo it in perl, and slurp the entire file in as a "single line".
> split across lines differently)
RegEx's to ignore line splits are trivial to write. Common even. A
"while loop" to cycle thru the matches is a fairly common construct.
Even multi megabyte htm lfiles would be easy this way. And perl is
*designed* to do this.
Or parse thru tidy first.
Or something similar. Plenty of cool libraries on CPAN to checkout.
programatically (ie a web browser), reversing that wouldn't be hard at
all. It's not like these are undocumented standards. Look for the
pattern(s). Match them.
If you wanted to get super sophisticated, you could grab the
> - You can navigate your site without using flashNever looked inside flash myself. so you could be very correct. :-)
Tho the GNU flash program (Gnash) may have appropriate libs that could
be borrowed or interfaced to, to do this. It's a lot harder, but still
achievable. The joy of Open Source: you don't have to reinvent the
Additional: Your acceptance testing should be picking this up.
Key word: "Should". :-)
> - Your site doesn't use <base> html elementsIf a spidering tool can't handle the base tag, or any legitimate HTML
tag, then the tool is broken. Submit bug report, get fix overnight.
Problem solved. :-)
> - Your grep command handles <a> elements that are split over multipleSee above. Being able to split up multiple things from a single chunk
> lines, and multiple <a> elements per line
of data is a common task. Perhaps grep is not the best tool, but awk,
sed and perl certainly are more than capable.
> - Your grep command handles other navigation elements, such as <frame>Two step it. Slurp the site, and glob all files in the resulting tree
> or <iframe>
with find or some such.
Or: I'm sure there's some simple spidering libraries in CPAN for
libwww. I seem to recall I came across some when I wrote our internal
Perl based link checker a few years ago.
> - Your site doesn't need a login or session cookies to be viewed.The advice is incorrect, wget can handle both. Or use curl which is
scarily sophisticated and powerful.
From the wget man page:
All of which do pretty much what they say. :-)
> These are some conditions - not a complete list - that would keep thisSure. No real disagreement. But there's also heaps of trivial
> from being an optimal approach. The best alternative is to utilize a
> product that can handle all those conditions, regardless of the site's
> operating system, web server, content creation solution or
> methodology, or web analytics vendor.
solutions that could be used to make life easier too and avoid many of
Aside: The big plus that Unix (as a collective) has over many other
systems is the most amazing array of simple but highly powerful tools
that can be easily glued together to do tasks that would be several
days of effort in any programming language.
Maybe wget and grep won't do it. But a combo of wget, find, sed, tidy,
egrep, uniq, sort and wc may. It may not be perfect, but it may get
you easily 70% of the way there. And 70% is a huge improvement on 0%.
The flip side is that if the problems raised are not existent for a
simple site, then they are not problems, and Tim's suggestion still
stands. And if you ain't got the money, you ain't got the money. :-)
I'm really quite tempted to accept the thrown gauntlet just to truely
satisfy my own burning curiosity as to how hard or easy the problem
actually is. Vs making an educated guess. Sounds like an interesting
challenge to burn a few hours or so. And it's been a few months since
I've done any serious Perl hacking. Hmmmm. And I could finally have a
decent excuse to try out the perl pthread libraries. Have had lots of
fun with those in C programs I've written. Hmmmmmmmmmmm....
become more sophisticated in Open Source analysis packages in the
future. Awstats has a very simple one already. And with that
increasing sophistication, the need for a matching solution to verify
same becomes necessary too. A solution will follow as night surely
I would argue not so much how hard or easy the problem is, but rather
argue on the additional value that Maxamine adds and brings to solving
the problems. Technical solutions are achievable, value add is harder.
Marketing and sustaining that value add is something else again.
> Debbie Pascoe
> MAXAMINE, Inc.
> > If you are UNIX-literate (or have such folks available to you), a simple
> > 'wget/grep' command should be enough.
- Steve, Unix Guru.
Tho I believe my actual position title, for what little meaning or
even relevance a position title holds, is: "Senior Unix Systems
Guru sums it up nicely and clears away the clutter. :-)
- Hi Lesley,
you might want to check my post at
where I list a couple of available solutions, but more interestingly,
I expose my Web Analytics Solution Profiler (WASP) idea.
Comments and sugestions from this group are welcome!
--- In email@example.com, "hunter_analytics"
> Does anyone know of or use a tool that is able to scan a site to
> report on the integrity of the tracking page tags? Whether the code
> is missing, rendered incorrectly etc.
> Any hints or pointers in right direction would be much appreciated.
It was great fun reading your point-by-point response, and I can tell
that you enjoyed the exercise :-)
Your last observation is the crucial thing. In this exchange, we have
discussed solving one particular site structure problem (determining
whether the tag implementation is correct and complete).
Site owners have to deal with many other issues, like checking for
site defects, evaluating their privacy implementations, determining if
blind people can interact with their site (ref. the current lawsuit
against Target by the National Federation of the Blind), being able to
respond at a moment's notice if they drop a product or terminate an
exec and need to remove all references, and more recently, monitoring
their employee-written blogs to be sure authors and respondents are
staying within bounds. These are just a few issues that we see every
day - there are lots of others I didn't mention.
Websites are growing increasingly large, complex and dynamic,
compounded by an increasing number of user-contributors. Both the
challenge and the importance of maintaining the quality of website
implementations are on the rise. We are seeing this clearly within
companies that seek us out.
The added value that you so correctly point out is in treating these
issues as multiple facets of the same problem, and addressing them all
with one high performance site analytics solution.
--- In firstname.lastname@example.org, Steve <nuilvows@...> wrote:
> On 9/22/06, Debbie Pascoe <dpascoe@...> wrote:
> > I checked with one of our Unix experts, who says that you would be
> > able to use wget and grep to download a page and look for page tags
> > and links and check the page tagging integrity, assuming that:
> Well not quite... :-)
> I would argue not so much how hard or easy the problem is, but rather
> argue on the additional value that Maxamine adds and brings to solving
> the problems. Technical solutions are achievable, value add is harder.
> Marketing and sustaining that value add is something else again.
> - Steve, Unix Guru.
> Tho I believe my actual position title, for what little meaning or
> even relevance a position title holds, is: "Senior Unix Systems
> Guru sums it up nicely and clears away the clutter. :-)