Online v. Log Analytics
I was reading Scott’s post on WASP (gotta install that, I’m a busybody, too!). And he talks a little about how some of the big sites don’t use online analytics (services like Google Analytics, Omniture and CoreMetrics) - presumably they do log file analysis. It reminded me of a talk I recently had with some guys at another major website who also have their own proprietary log analysis tools. So I wondered a bit about the pros and cons.
Here’s the thing, for dead right on accuracy it is most likely not possible to beat log file analysis. If the website served it it’s going in the log. Placing a javascript or image based beacon on your site can fail for a wide variety of reasons - not enough time spent on the page to make the call, javascript disabled, javascript incompatible, some wierd firewall/adware type blocker - a lot of things could potentially interfere with beacons on the page. So the log files give you the opportunity to be more accurate - but the trick is actually getting the level of accuracy up.
The obvious problem with log files is that you get a lot of traffic from robots and spiders - these shouldn’t be counted toward your analytics numbers, but it can be increasingly tough to weed them all out. There are a lot of spiders now and more springing up every day. Online beacons skirt this problem by taking advantage of the fact that spiders don’t execute the javascript at all - so they don’t even have to worry about them. There’s various tricks one can do, using cookies but cookies won’t appear in your log until after the first hit of that user - they actually have to hit your page before you can cookie them so that first page will not show their information. In this day and age, first visits are also often last visits. So you need to be clever about such things.
Firefox’s precaching is another example of log file difficulty. Firefox will grab a page because there’s a link to it on the page you a currently viewing - if you click that link, it comes up right away, if not - no big deal for you - but that went right into the logs. That’s a really hard line to know to disregard since it looks and feels just like a firefox view. Again, javascript neatly skirts the problem.
The other advantage that online analytics have is that they tend to be run by folks with lots of computing power and the stats you get are close to real time (if delayed by a few hours for some services). This is a real boon as you can see how things are doing during the course of the day and react to that. Log analysis doesn’t inherently prevent you from knowing this information, but most big analysis I’ve heard of (especially for the big sites) happens once a day - so you only find out how you did the day after.
If you’re a big organization,, have development resources to devote to analysis and need to suss out really detailed information about the usage of your website - log analysis could be the ticket. You can log a lot of interesting information and develop code that understands the nature of your urls to give you a really specific understanding of your website. If that’s not you I don’t really see any advantage to using a standard log analysis package over something like Google Analytics. The numbers are going to be wildly different with online analytics generally being significantly less than any log analysis package and you’ll have to believe one. i’m guessing that the Google numbers are closer to true than log analysis, just because there’s far too many variables for them all to take care of.








June 19th, 2007 at 6:32 am
There are a few other good reasons to use javascript beacon based analytics. Let’s say you have a large heterogenous webserver farm, in which some pages are served from a bunch of linux/apache servers, while other pages are served from a bunch of windows/iis servers, etc. There is often no good way to concatenate log files from a lot of servers and run a log analysis tool within a reasonable window of time, especially if you’re pushing extra data on each request into the log files to do advanced analysis (and thus have gigantic log files).
Also, the log analysis packages have weak or nonexistent segmentation and even goal tracking capabilities. Admittedly, I’ve mostly used the free kind (awstats, etc.) but the beacon/service approach is clearly where the sophisticated development work is being done.
Of course, if you’re a commerce site and have metrics other than page impressions that you care about, your e-commerce application more than likely has the reporting features you need to do baseline reporting — though, still, you need something more sophisticated to track KPI’s like cart abandons, sales funnels, visits-to-orders, etc. Even the iTunes Music store uses Omniture, to do sophisticated path analysis, etc.
Then again, with beacon-based approaches, you have to jump through hoops to analyze your error pages (404, 500, etc.), and beacon-based approaches make it a bit more difficult to accurately count non-html downloads (images, pdfs, etc.). Page weight — and data transfer speed — is probably also something easier to analyze with log files.
I watch my traffic live using ApacheTop, which uses log files to build a live window of traffic size (requests/sec) and destination. That’s something I’ve yet to see Omniture do with the same level of technical detail.
So, like everybody, we end up using several simultaneous methods, including log analysis and a bunch of beacons.
June 19th, 2007 at 6:47 am
Good points, all. The thing is, if you are big and doing your own log analysis - you almost certainly are running custom code - existing log analsysis packages, I just don’t believe they are giving correct numbers. And if you are running a heterogenous site - you wouldn’t need to merge different log files together - unless you are saying that the site is load balanced among a set of heterogenous servers serving the same content - which doesn’t seem likely.
I think in general you are right - most companies simply do not have the resources to devote to analyzing their own logs. Companies like Omniture spend all day every day figuring out more smart ways to slice and dice the data, so I think they are the way almost everyone should go.
June 19th, 2007 at 6:58 am
Sorry, I meant load-balanced servers with different parts of the site served on different platforms. This happens all the time.
Still, I agree with your fundamental point, that companies like Omniture spend tons of resources making their analytics insanely terrific and they tend to be the right way to go.
July 10th, 2007 at 7:02 am
[...] this not to say that page views and vists are not without their problems. But I strongly believe that is possible to accurately count page views (although not everyone does [...]
March 19th, 2008 at 10:45 am
[...] when G birthed Google Analytics into the world? How every website ever immediately started using it? While a significantly smaller [...]