50 Shades of "Down" (Drupal Style)

By mrbagnall | 01 August 2020

We've all gotten it. The call. The text. The email. "The site is down". Nothing more. Nothing less. That nebulous nugget of tech support clickbait that anyone in a position of responsibility is forced to drill down on in order to make sure we've properly covered our asses even though it is most likely some other form of content, user or other error in perception of difference between what "up" really means and what "down" really is. 

Right now, I think we should all have a serious talk about "down". What is "down". What is "up". What are things you can delegate and when do you need to roll up your sleeves.

There is no definition of the word "down" that covers what we're talking about. If there was it might look something like this:

down, pronounced: /doun/ : adj.

  1. A state of being unavailable
  2. An unexpected inability of a system to consistently and routinely perform its expected function under ordinary operational conditions.

Without getting into a lengthy diatribe of semantics, the word "down" means any situation under which your web site is unavailable, inaccessible, or otherwise throwing errors, installation screens, or database exceptions under what are supposed to be normal operations. We, as Drupal developers, have other more nuanced definitions of "down: and a pretty standard one of "up".

1. **DOWN** - This means the site is DOWN. You get 502, 500, or connection refused situations where the request either isn't getting to the site, or is generating an error when the request is made. These are more than likely caused by coding, networking, system or other technical failure of the system that includes, but is not necessarily limited, to Drupal. This could include white screens of death but DOES NOT INCLUDE the "Site produced an unexpected error" result. Those are normally handled by #2 below. Other examples of your site being "DOWN":

  1. You get the standard Drupal install screen when you try to go to the home page. I cannot even begin to describe to you how bad this is and how much cold sweat it cases on behalf of developers and system administrators alike.
  2. You get a browser default screen with something like "ERR_CONNECTION_REFUSED" or other message that indicates possible DNS or Apache related issue that is not related to Drupal. 
  3. White screen of death. No output. No nothing. Just a white screen. Commonly introduced by print command debug output but can be caused by other errors not trapped by Drupal.

2. "Down" - This means the site is throwing unexpected or unpredictable output. The web site response, is working and is available, but is not operating as it should be. These are the kinds of items that are most successfully handled by clearing the cache on your Drupal site, and it all just magically works again. Popular example:

  1. "The website encountered an unexpected error. Please try again later." - While these **can** be an issue under #1, usually a cache clear will fix this on a production system. Other causes include the inability to access the database (credentials, database being down). It's not **DOWN** because the server is responding. Drupal is just not happy.

3. down. - This means the site has a content related issue - a recently created view that wasn't thoroughly tested. A worflow issue that wasn't properly handled by site managers and typically has nothing to do with code or the site operations area but gets escallated to the lead architect or system administrator as part of the opening triage. An example of this would be the an incorrect revision of code or improperly sorted data in a table or view.

4. Up. Do you really need me to define this?

These are a few examples of the nuances in which Drupal sites can be unavailable or produce undesired contents. As a general rule, if your web site is working, but doing something strange, clear the cache. This fixes the problem a good 70-80% of the time. I have fixed many "serious outages" doing this. Other things cache clearing fixes:

  1. Render arrays rendering as arrays and not content
  2. Style changes as part of a deployment
  3. Content revisions that are also behind Varnish caching
  4. Content revisions behind a CDN such as Akamai, CloudFront or AzureEdge.

The list goes on.

If you can think of scenarios I should add to this list, email them to me at: mbagnall@flyingflip.com and I'll update this article and credit you if you like.

Managers, CEO's, CTO's and others will use words like "down" to manipulate you to acting. The key is to get them to tell you what the problem is and not describe it. Because saying something is down tells you basically nothing. Tell me about the behavior. Send me a screen shot. These will help me 10x more than just a nebulous report of a site being "DOWN", "down" or down.