Welcome to the Healthchecks.io status page. If there are interruptions to service, we will post a report here, and on our Mastodon account.
Empezado:
Duración:
On August 10, 2023, starting from 10:42 UTC, Healthchecks.io had a database server outage that lasted for about 30 minutes. The outage affected monitoring dashboard (it was slow or unavailable), API and ping endpoints (slow or unavailable), and the notification sending process (missed pings caused false notifications later, notifications were delayed). During the outage the database server was unstable: fluctuating ping times, messages about network interface re-initialization in syslog, kernel warning messages – a small excerpt below, timestamps in CET:
Aug 10 12:51:11 db6 kernel: AMD-Vi: Completion-Wait loop timed out Aug 10 12:51:11 db6 kernel: AMD-Vi: Completion-Wait loop timed out Aug 10 12:51:11 db6 kernel: AMD-Vi: Completion-Wait loop timed out Aug 10 12:51:11 db6 kernel: AMD-Vi: Completion-Wait loop timed out Aug 10 12:51:11 db6 kernel: AMD-Vi: Completion-Wait loop timed out Aug 10 12:51:11 db6 kernel: AMD-Vi: Completion-Wait loop timed out Aug 10 12:51:11 db6 kernel: AMD-Vi: Completion-Wait loop timed out Aug 10 12:51:11 db6 kernel: AMD-Vi: Completion-Wait loop timed out Aug 10 12:51:11 db6 kernel: watchdog: BUG: soft lockup - CPU#1 stuck for 112s! [swapper/1:0] [...] Aug 10 12:51:11 db6 kernel: AMD-Vi: Completion-Wait loop timed out Aug 10 12:51:11 db6 kernel: AMD-Vi: Completion-Wait loop timed out Aug 10 12:51:11 db6 kernel: AMD-Vi: Completion-Wait loop timed out Aug 10 12:51:11 db6 kernel: AMD-Vi: Completion-Wait loop timed out Aug 10 12:51:11 db6 kernel: AMD-Vi: Completion-Wait loop timed out Aug 10 12:51:11 db6 kernel: AMD-Vi: Completion-Wait loop timed out Aug 10 12:51:11 db6 kernel: AMD-Vi: Completion-Wait loop timed out Aug 10 12:51:11 db6 kernel: AMD-Vi: Completion-Wait loop timed out Aug 10 12:51:11 db6 kernel: AMD-Vi: Completion-Wait loop timed out
The instability caused database transaction times to shoot up, and for queued incoming pings to pile up. When database connections appear stuck, clients are programmed to abort and open new connections. Connection retries lead to Postgres reaching the configured connection limit.
After shutting down the database, and rebooting the server (now running a newer kernel), the database started up, and after an initial period of catchup, resumed normal operation.
It is not yet clear if the instability was caused by a harware fault, by a kernel bug, or by something else. Till now, the server has been running in production for two years with no issues. Hetzner support suggested to start by upgrading to a newer kernel, which is now done. I will monitor the server and migrate the database off it if similar symptoms occur again.
Timeline (timestamps in UTC):
- 10:42:26 System operating normally, postgres completes an autovacuum run.
- 10:42:52 First kernel warning appears in syslog.
- 10:42:53 A queue of unprocessed pings starts to build on app servers.
- 10:43:44 Postgres runs out of connections, starts returning "sorry, too many clients already" errors.
- 10:45:03 I initiate a database restart (not yet aware of kernel warnings).
- 10:51:08 The database has shut down and starts back up.
- 10:54:06 Client requests still time out, and clients quickly exhaust the connection limit again.
- 10:54:60 I initiate a database shutdown.
- 11:00:22 The database has shut down, and I try to soft-reboot the server.
- 11:03:57 Reboot is stuck, I initiate a hardware reset of the server (the equivalent of pressing the physical "Reset" button).
- 11:04:57 First syslog message about the system booting up.
- 11:05:02 Database starts up, and starts processing client requests. No more kernel warning messages.
- 11:45:00 The system has caught up with notification backlog and is operating normally.