Healthchecks.io Estado - Historial de incidentes

Oct, 2023

False alerts during DST transition
Empezado: 29 Oct, 03:00 UTC
Duración: 1 día, 7 horas
Healthchecks.io sent a number of false "DOWN" alerts during the October 29 Daylight Saving Time transition.

The issue affected checks that use cron schedule, and were scheduled to run during the DST transition window.

The false alerts were caused by a code bug, we are working on a fix.
Resuelto: 30 Oct, 10:06 UTC
A bug fix has been implemented and deployed.

Sep, 2023

Todos los sistemas están operativos.

Ago, 2023

Database issuesNotification SenderPing APIDashboard
Empezado: 10 Ago, 11:09 UTC
Duración: 2 horas, 19 minutos
We're having database overload issues – investigating.
Post-mortem
On August 10, 2023, starting from 10:42 UTC, Healthchecks.io had a database server outage that lasted for about 30 minutes. The outage affected monitoring dashboard (it was slow or unavailable), API and ping endpoints (slow or unavailable), and the notification sending process (missed pings caused false notifications later, notifications were delayed). During the outage the database server was unstable: fluctuating ping times, messages about network interface re-initialization in syslog, kernel warning messages – a small excerpt below, timestamps in CET:
```
Aug 10 12:51:11 db6 kernel: AMD-Vi: Completion-Wait loop timed out
Aug 10 12:51:11 db6 kernel: AMD-Vi: Completion-Wait loop timed out
Aug 10 12:51:11 db6 kernel: AMD-Vi: Completion-Wait loop timed out
Aug 10 12:51:11 db6 kernel: AMD-Vi: Completion-Wait loop timed out
Aug 10 12:51:11 db6 kernel: AMD-Vi: Completion-Wait loop timed out
Aug 10 12:51:11 db6 kernel: AMD-Vi: Completion-Wait loop timed out
Aug 10 12:51:11 db6 kernel: AMD-Vi: Completion-Wait loop timed out
Aug 10 12:51:11 db6 kernel: AMD-Vi: Completion-Wait loop timed out
Aug 10 12:51:11 db6 kernel: watchdog: BUG: soft lockup - CPU#1 stuck for 112s! [swapper/1:0]
[...]
Aug 10 12:51:11 db6 kernel: AMD-Vi: Completion-Wait loop timed out
Aug 10 12:51:11 db6 kernel: AMD-Vi: Completion-Wait loop timed out
Aug 10 12:51:11 db6 kernel: AMD-Vi: Completion-Wait loop timed out
Aug 10 12:51:11 db6 kernel: AMD-Vi: Completion-Wait loop timed out
Aug 10 12:51:11 db6 kernel: AMD-Vi: Completion-Wait loop timed out
Aug 10 12:51:11 db6 kernel: AMD-Vi: Completion-Wait loop timed out
Aug 10 12:51:11 db6 kernel: AMD-Vi: Completion-Wait loop timed out
Aug 10 12:51:11 db6 kernel: AMD-Vi: Completion-Wait loop timed out
Aug 10 12:51:11 db6 kernel: AMD-Vi: Completion-Wait loop timed out
```
The instability caused database transaction times to shoot up, and for queued incoming pings to pile up. When database connections appear stuck, clients are programmed to abort and open new connections. Connection retries lead to Postgres reaching the configured connection limit.

After shutting down the database, and rebooting the server (now running a newer kernel), the database started up, and after an initial period of catchup, resumed normal operation.

It is not yet clear if the instability was caused by a harware fault, by a kernel bug, or by something else. Till now, the server has been running in production for two years with no issues. Hetzner support suggested to start by upgrading to a newer kernel, which is now done. I will monitor the server and migrate the database off it if similar symptoms occur again.

Timeline (timestamps in UTC):
- 10:42:26 System operating normally, postgres completes an autovacuum run.
- 10:42:52 First kernel warning appears in syslog.
- 10:42:53 A queue of unprocessed pings starts to build on app servers.
- 10:43:44 Postgres runs out of connections, starts returning "sorry, too many clients already" errors.
- 10:45:03 I initiate a database restart (not yet aware of kernel warnings).
- 10:51:08 The database has shut down and starts back up.
- 10:54:06 Client requests still time out, and clients quickly exhaust the connection limit again.
- 10:54:60 I initiate a database shutdown.
- 11:00:22 The database has shut down, and I try to soft-reboot the server.
- 11:03:57 Reboot is stuck, I initiate a hardware reset of the server (the equivalent of pressing the physical "Reset" button).
- 11:04:57 First syslog message about the system booting up.
- 11:05:02 Database starts up, and starts processing client requests. No more kernel warning messages.
- 11:45:00 The system has caught up with notification backlog and is operating normally.
Verificando: 10 Ago, 11:19 UTC
Database is back up, we're now working through backlog of unsent notifications. Many of the notifications will be false notifications caused by this outage – apologies.
Verificando: 10 Ago, 11:49 UTC
The services are back to normal.

Now working through logs and monitoring data and piecing together what exactly happened.
Resuelto: 10 Ago, 13:28 UTC
Resolved – all systems are operating normally.

The outage was caused by a kernel fault on the database server. The kernel is now upgraded to a newer stable version. A more in-depth post-mortem will follow later.

Incidentes pasados