Overcoming limitations of Web stats
- All Website access statistics, whether produced by Webalizer or some other package, suffer from at least one important limitation - they don't distinguish between access requests from humans and those from Web crawlers (good or bad). That's not a problem if you're just looking at capacity management issues, but it's a huge problem if you want to learn about which parts of your Website are valuable to your human visitors.
I think I've figured out a way to overcome that limitation (mostly), so I'm writing to y'all to lay out my ideas, report progress, and ask for comments. And the process of writing this will help to clarify what I'm thinking.
The raw data contains no explicit indication of what caused each access attempt, but it does contain clues from which useful information can be deduced. After looking at raw data several different ways, I've come up with what I hope is a useful approach, based on agent-id.
The basic approach is to use those clues to make a best guess as to which agents are used by humans and which are bots. Then use that distinction to split each access log into two or three parts (human, bot, and maybe indeterminate). Finally, use Webalizer to re-analyze whichever part(s) I/we find interesting.
So what are the clues?
1. Anything that requests "favicon.ico" is most probably a browser used by a human, while anything that requests "robots.txt" is almost certainly a bot. A tiny number of agents ask for both, and a large number ask for neither, so those are indeterminate at this point. I've written an analysis program, which provided the following figures from our first full month of operation:
- nearly 93000 access requests
- from just over 1700 agents
- of which 540 occurred only once
- leaving 1170 reasonably analyzable
- of which over 700 requested favicon.ico
- - - and over 100 requested robots.txt
- - - (though 10 requested both)
- - - leaving over 300 requesting neither
2. An agent which is only seen once in a month is most probably a bot, especially if its ID looks like a random string of characters. (See above for counts.)
3. Some access requests identify a referrer [typically misspelled "referer" in server documentation], but most do not. It seems likely that those which do probably come from humans, but this clue requires further investigation. At the moment all I can say is that our first month's data shows that about 30% of requests come from referrals, and about 2/3 of those are from our own Website.
4. Successful requests for items contained in a password-protected directory must come from humans.
5. Requests for items which were never on our Website (e.g., PHP files - YMMV) must indicate hacker intrusion attempts, which probably would be carried out by robots.
Further programming is required to investigate these clues more deeply and to make use of the results. Meanwhile ...
Questions? Comments? Suggestions?
(and does anyone have a Webalizer binary for Mac OS X - any version?)
Carl Scott Zimmerman