Healthchecks.io Estado

¿Necesita ayuda?

Envíanos un e-mail
Welcome to the Healthchecks.io status page. If there are interruptions to service, we will post a report here, and on our Mastodon account. 

Incidentes pasados

Sep, 2023
    
 
Ago, 2023
 
   
Jul, 2023
     
      
  1. Sep, 2023

    Todos los sistemas están operativos.

  2. Ago, 2023

    1. Database issuesNotification SenderPing APIDashboard
      Empezado:
      Duración:
      We're having database overload issues – investigating.
      Post-mortem
      On August 10, 2023, starting from 10:42 UTC, Healthchecks.io had a database server outage that lasted for about 30 minutes. The outage affected monitoring dashboard (it was slow or unavailable), API and ping endpoints (slow or unavailable), and the notification sending process (missed pings caused false notifications later, notifications were delayed). During the outage the database server was unstable: fluctuating ping times, messages about network interface re-initialization in syslog, kernel warning messages – a small excerpt below, timestamps in CET:

      Aug 10 12:51:11 db6 kernel: AMD-Vi: Completion-Wait loop timed out
      Aug 10 12:51:11 db6 kernel: AMD-Vi: Completion-Wait loop timed out
      Aug 10 12:51:11 db6 kernel: AMD-Vi: Completion-Wait loop timed out
      Aug 10 12:51:11 db6 kernel: AMD-Vi: Completion-Wait loop timed out
      Aug 10 12:51:11 db6 kernel: AMD-Vi: Completion-Wait loop timed out
      Aug 10 12:51:11 db6 kernel: AMD-Vi: Completion-Wait loop timed out
      Aug 10 12:51:11 db6 kernel: AMD-Vi: Completion-Wait loop timed out
      Aug 10 12:51:11 db6 kernel: AMD-Vi: Completion-Wait loop timed out
      Aug 10 12:51:11 db6 kernel: watchdog: BUG: soft lockup - CPU#1 stuck for 112s! [swapper/1:0]
      [...]
      Aug 10 12:51:11 db6 kernel: AMD-Vi: Completion-Wait loop timed out
      Aug 10 12:51:11 db6 kernel: AMD-Vi: Completion-Wait loop timed out
      Aug 10 12:51:11 db6 kernel: AMD-Vi: Completion-Wait loop timed out
      Aug 10 12:51:11 db6 kernel: AMD-Vi: Completion-Wait loop timed out
      Aug 10 12:51:11 db6 kernel: AMD-Vi: Completion-Wait loop timed out
      Aug 10 12:51:11 db6 kernel: AMD-Vi: Completion-Wait loop timed out
      Aug 10 12:51:11 db6 kernel: AMD-Vi: Completion-Wait loop timed out
      Aug 10 12:51:11 db6 kernel: AMD-Vi: Completion-Wait loop timed out
      Aug 10 12:51:11 db6 kernel: AMD-Vi: Completion-Wait loop timed out

      The instability caused database transaction times to shoot up, and for queued incoming pings to pile up. When database connections appear stuck, clients are programmed to abort and open new connections. Connection retries lead to Postgres reaching the configured connection limit.

      After shutting down the database, and rebooting the server (now running a newer kernel), the database started up, and after an initial period of catchup, resumed normal operation. 

      It is not yet clear if the instability was caused by a harware fault, by a kernel bug, or by something else. Till now, the server has been running in production for two years with no issues. Hetzner support suggested to start by upgrading to a newer kernel, which is now done. I will monitor the server and migrate the database off it if similar symptoms occur again.

      Timeline (timestamps in UTC):
      • 10:42:26 System operating normally, postgres completes an autovacuum run. 
      • 10:42:52 First kernel warning appears in syslog.
      • 10:42:53 A queue of unprocessed pings starts to build on app servers.
      • 10:43:44 Postgres runs out of connections, starts returning "sorry, too many clients already" errors.
      • 10:45:03 I initiate a database restart (not yet aware of kernel warnings).
      • 10:51:08 The database has shut down and starts back up.
      • 10:54:06 Client requests still time out, and clients quickly exhaust the connection limit again.
      • 10:54:60 I initiate a database shutdown.
      • 11:00:22 The database has shut down, and I try to soft-reboot the server.
      • 11:03:57 Reboot is stuck, I initiate a hardware reset of the server (the equivalent of pressing the physical "Reset" button).
      • 11:04:57 First syslog message about the system booting up.
      • 11:05:02 Database starts up, and starts processing client requests. No more kernel warning messages.
      • 11:45:00 The system has caught up with notification backlog and is operating normally.

      Verificando:

      Database is back up, we're now working through backlog of unsent notifications. Many of the notifications will be false notifications caused by this outage – apologies.

      Verificando:

      The services are back to normal. 

      Now working through logs and monitoring data and piecing together what exactly happened.

      Resuelto:

      Resolved – all systems are operating normally.

      The outage was caused by a kernel fault on the database server. The kernel is now upgraded to a newer stable version. A more in-depth post-mortem will follow later.
  3. Jul, 2023

    Todos los sistemas están operativos.