A guide to site monitoring

12^th February 2011, 1,581 words

One of the most important parts of running an excellent, available site is knowing what it's doing. Time and again I see web businesses where the team are surprised to learn about the performance of their site when they hear about problems from their customers, or even -- horror of horrors -- when Google caches error messages on their site, as happened to the Internet World conference last year.

With a tiny investment and a little time, most businesses can gain a great deal of insight into what their site is doing by using one of the many third party monitoring services available, or by pulling together open source tools and using them. This post seeks to explore some of the challenges around adopting a monitoring system and to act as a guide for those looking to do so. Key benefits of monitoring include:

Problems with the site get found, recorded and reported as they happen. This means there are no nasty surprises from customers when the site fails.
The business will have a better understanding of how often a problem is occurring, an idea of when it might occur again, and some of the information it needs to decide how much to invest in resolving a particular conversion blocking problems.
More information is available on each problem, which should help the technical teams replicate and prevent the problem from reoccurring.

Which type of monitoring?

Firstly, it's worth clarifying what "monitoring" means in this context. There are a number of different bits of data businesses might seek to collect about their sites, but I'm focusing solely on data pertaining to the performance and availability of web sites and their supporting infrastructure. Data around user behaviour and conversion (heat, AB, MVT, analytics, etc.) are quite separate. I don't discuss capacity planning in this article, but that is an important discipline which can be supported by much of this monitoring information.

Adopting a monitoring system or systems can be challenging, and as with any exercise in collecting information about potential future or current intermittent events it is hard to know which data to mine. A lot of data are available, and to avoid being swamped by it one should only collect what matters. There's something of a sliding scale as sites increase their revenue or importance to the business, although not all data is appropriate for every site. Those seeking to monitor their site might prioritise their requirements in order as below:

Simple external availability, external response times, false positive handling (auto-retry)
Response code monitoring and simple phrase match (the latter is more important if you fear pages being changed or the application doesn't send correct response codes, sometimes with ASP legacy apps)
Full page request times: separating dynamic page generation from static content, much like Firebug or Selenium do at a client level. This should also provide data on the onLoad event, and total JavaScript render time.
Page request breakdown (showing how the total download time is composed of DNS lookup, TCP connection time, page download time / throughput, etc.)
Transactional monitoring (ecommerce sites might have log in, log out, register, sign up for newsletter, add to basket, checkout... don’t forget to use fake credit card details & exclude these orders from your reporting!)
Competitor monitoring and performance comparison
International monitoring and trace-routing (particularly important for international businesses, as the other side of the world is many milliseconds away)
A reporting API to mine data, and outage / event annotation (so that comments can be logged against outages for use in subsequent reporting, and output as an incident log). Automatic comparison against a service level agreement (SLA) from within a monitoring tool can save a lot of time.
Related service monitoring: CDN, DSA, DNS, SMTP, MX, third-party tags
...a whole bunch of other clever stuff, including custom alert mechanisms, event triaging, trending and prediction, 95th percentile analysis, lengthy and granular data retention, and keep alives (ie. alerting if regular check-in from the application doesn’t happen)

It should be apparent that very early stage startups have relatively little to worry about. A risk of making some sweeping generalisations, they likely have a lot of other, bigger problems to worry about. It seems safe to suggest that many startups have less complex, and better factored platforms (because they're almost certainly greenfield). Startups competing for funding typically derive their competitive advantage from their teams, rather than beating any established competitors' uptime.

The real cost of downtime

TechRepublic have a good article on the cost of downtime, but they consider only the direct revenue impact. For a large ecommerce business half an hour of outage can produce a significant to revenue. However, there are also considerations to make for businesses offering SaaS platforms or other services: their SLAs with clients may have punitive clauses for outages. There’s a further reputational impact to downtime. Blaine Cook, Twitter’s first architect, drew a lot of fire over the repeated downtime Twitter ran into whilst it was scaling up.

Solving the requirements for availability (external) monitoring

Site availability and performance monitoring is often described as "external" monitoring because it makes sense to monitor this externally. Failure of vital internal infrastructure, such as a line, alerting system or database could render your monitoring system useless, so it is considered best practise to externally monitor (and alert on) business' infrastructure. It makes a lot of sense to have two external monitoring systems, as they're relatively inexpensive and it is much less likely for two external systems to fail simultaneously. These external monitoring systems are generally used to make fairly simple checks on the site (and often on the internal monitoring system) whilst the more internal monitoring systems report on a much more granular set of data.

A large number of external hosted monitoring systems are available over the web. site24x7, wasitup and howsthe.com are all simple to set up and provide simpler monitoring functionality. Aware, Pingdom, Axzona and SciVisum are more advanced, and Gomez and SiteConfidence are the costliest but most feature-rich.

...and moving on to service (internal) monitoring

There are a number of products which provide robust internal server and service monitoring. SolarWinds and Nagios are among the most common. The typical breadth and frequency monitored make these systems impractical to run or host externally, and businesses rarely employ services such as these until they reach a certain scale. Nagios is fiddly but free whereas SolarWinds is simpler but costly. These systems can monitor just about anything: line performance, terminal adapters, routers, WMI or SNMP counters, disk space, CPU usage, event logs. Often monitored data can be so rich that event aggregator or heuristic systems are needed, such as Splunk or OSSEC. This is much more complicated that external monitoring, and is best addressed in a separate blog post.

A word on Server Density and Landscape

David Mytton and his Boxed Ice team built a great product in Server Density, and deservedly did very well in 2009’s Seedcamp. If you're not familiar with Server Density, you should check it out now. They've set an excellent standard for Web 2 startup application design in the UK, and his blogging is a treat. To sum up server density in their words, it’s "easy server monitoring".

Server Density isn’t a new take on an existing product, but rather the confluence of two existing solutions: external availability / performance monitoring and internal service monitoring and management.

It's neat application, but it seems too expensive for an external monitoring tool, and inadequate as an internal service monitoring tool in that it's not broad enough. It is hard to see it ever being so in its hosted model, given that it sits behind the firewall and should not publish the granularity of data that a full service monitoring tool would require. With custom plugins and their self-hosted model they'll be able to report on more interesting data, such as latency between certain hosts, or esoteric counters on proprietary pieces of hardware and software, but this may take a bit of hacking. Canonical -- Mark Shuttleworth’s company which builds Ubuntu -- has a similar product for Linux servers called Landscape.

Visibility of metrics

Finally, most of these systems provide for SMS and email alerting, and often provide fairly simple web dashboards. Nick and Simon at Aware recently released a public uptime report for Huddle, which is clearly suitable for businesses offering a public web service. Whilst none of the web-based monitor services can really provide excellent at-a-glance reports showing the performance of the business or application, a new wave of businesses are starting to provide solutions for this. Geckoboard, LeftTronic and Dashboard.me all look very exciting. Geckoboard provide a link to one of their public dashboards. Composite monitoring bliss.

Wrapping up

There's a bewildering array of choices to make when setting up monitoring, but the number of competing companies has brought the price of reliable monitoring down to a lower point that ever. It's easily possible to get meaningful, reliable monitoring for tens of dollars a month.