Healthchecks.io Status - Incident history

Welcome to the Healthchecks.io status page. If there are interruptions to service, we will post a report here, and on our Mastodon account.

Past incidents

Aug, 2023

Jul, 2023

Jun, 2023

Aug, 2023
1. Database issuesNotification SenderPing APIDashboard
  Started: 10 Aug, 11:09 UTC
  Duration: 2 hours, 19 minutes
  We're having database overload issues – investigating.
  Post-mortem
  On August 10, 2023, starting from 10:42 UTC, Healthchecks.io had a database server outage that lasted for about 30 minutes. The outage affected monitoring dashboard (it was slow or unavailable), API and ping endpoints (slow or unavailable), and the notification sending process (missed pings caused false notifications later, notifications were delayed). During the outage the database server was unstable: fluctuating ping times, messages about network interface re-initialization in syslog, kernel warning messages – a small excerpt below, timestamps in CET:
  
  Aug 10 12:51:11 db6 kernel: AMD-Vi: Completion-Wait loop timed out Aug 10 12:51:11 db6 kernel: AMD-Vi: Completion-Wait loop timed out Aug 10 12:51:11 db6 kernel: AMD-Vi: Completion-Wait loop timed out Aug 10 12:51:11 db6 kernel: AMD-Vi: Completion-Wait loop timed out Aug 10 12:51:11 db6 kernel: AMD-Vi: Completion-Wait loop timed out Aug 10 12:51:11 db6 kernel: AMD-Vi: Completion-Wait loop timed out Aug 10 12:51:11 db6 kernel: AMD-Vi: Completion-Wait loop timed out Aug 10 12:51:11 db6 kernel: AMD-Vi: Completion-Wait loop timed out Aug 10 12:51:11 db6 kernel: watchdog: BUG: soft lockup - CPU#1 stuck for 112s! [swapper/1:0] [...] Aug 10 12:51:11 db6 kernel: AMD-Vi: Completion-Wait loop timed out Aug 10 12:51:11 db6 kernel: AMD-Vi: Completion-Wait loop timed out Aug 10 12:51:11 db6 kernel: AMD-Vi: Completion-Wait loop timed out Aug 10 12:51:11 db6 kernel: AMD-Vi: Completion-Wait loop timed out Aug 10 12:51:11 db6 kernel: AMD-Vi: Completion-Wait loop timed out Aug 10 12:51:11 db6 kernel: AMD-Vi: Completion-Wait loop timed out Aug 10 12:51:11 db6 kernel: AMD-Vi: Completion-Wait loop timed out Aug 10 12:51:11 db6 kernel: AMD-Vi: Completion-Wait loop timed out Aug 10 12:51:11 db6 kernel: AMD-Vi: Completion-Wait loop timed out
  
  The instability caused database transaction times to shoot up, and for queued incoming pings to pile up. When database connections appear stuck, clients are programmed to abort and open new connections. Connection retries lead to Postgres reaching the configured connection limit.
  
  After shutting down the database, and rebooting the server (now running a newer kernel), the database started up, and after an initial period of catchup, resumed normal operation.
  
  It is not yet clear if the instability was caused by a harware fault, by a kernel bug, or by something else. Till now, the server has been running in production for two years with no issues. Hetzner support suggested to start by upgrading to a newer kernel, which is now done. I will monitor the server and migrate the database off it if similar symptoms occur again.
  
  Timeline (timestamps in UTC):
  10:42:26 System operating normally, postgres completes an autovacuum run.
  10:42:52 First kernel warning appears in syslog.
  10:42:53 A queue of unprocessed pings starts to build on app servers.
  10:43:44 Postgres runs out of connections, starts returning "sorry, too many clients already" errors.
  10:45:03 I initiate a database restart (not yet aware of kernel warnings).
  10:51:08 The database has shut down and starts back up.
  10:54:06 Client requests still time out, and clients quickly exhaust the connection limit again.
  10:54:60 I initiate a database shutdown.
  11:00:22 The database has shut down, and I try to soft-reboot the server.
  11:03:57 Reboot is stuck, I initiate a hardware reset of the server (the equivalent of pressing the physical "Reset" button).
  11:04:57 First syslog message about the system booting up.
  11:05:02 Database starts up, and starts processing client requests. No more kernel warning messages.
  11:45:00 The system has caught up with notification backlog and is operating normally.
  Verifying: 10 Aug, 11:19 UTC
  Database is back up, we're now working through backlog of unsent notifications. Many of the notifications will be false notifications caused by this outage – apologies.
  Verifying: 10 Aug, 11:49 UTC
  The services are back to normal.
  
  Now working through logs and monitoring data and piecing together what exactly happened.
  Resolved: 10 Aug, 13:28 UTC
  Resolved – all systems are operating normally.
  
  The outage was caused by a kernel fault on the database server. The kernel is now upgraded to a newer stable version. A more in-depth post-mortem will follow later.
Jul, 2023
All systems operational.
Jun, 2023
All systems operational.