On Aug 18 03:38:37 (UTC+2) a backup script behaving badly filled up the entire server hard drive. This caused a number of services to crash, e.g. database and caching, which are pretty important parts of the system. This meant that pretty much the entire site returned 500 errors, the exceptions being pages that nginx had cached.

Nginx is set to micro-cache all pages on this site, and in the case of exceptions or other errors it serves stale (expired) versions of the cached page. Coincidentally this included the page the health checker uses to verify that the site is OK, which meant that the problem went unnoticed by the health check.

The problem was fixed 3.5 hours later. It is a pretty unfortunate series of events, but still; a few lessons learned:

  1. Don’t cache the health check URL.
  2. Keep your backup scripts in order.
  3. Notice when the hard drive is getting full…

  •   Last modified 11 months ago