Healthchecks.io Status - Incident history

Welcome to the Healthchecks.io status page. If there are interruptions to service, we will post a report here, and on our Mastodon account.

Past incidents

Dec, 2021

Nov, 2021

Oct, 2021

Dec, 2021
1. Degraded performance due to database overloadNotification SenderPing APIDashboard
  Started: 16 Dec, 05:30 UTC
  Duration: 14 hours, 35 minutes
  At 5:30 UTC, our primary database became overloaded, this caused slow HTTP response times, and notification delays.
  Investigating: 16 Dec, 06:35 UTC
  The database is back to normal currently. The overload seems to have been triggered by a scheduled nightly database backup, but still investigating.
  Resolved: 16 Dec, 20:05 UTC
  The issue is resolved.
2. Network connectivity problemsPing APIDashboard
  Started: 15 Dec, 20:28 UTC
  Duration: 11 hours, 40 minutes
  Our hosting provider is currently performing a backbone connection maintenance. This is causing intermittent latency spikes and connection timeouts.
  
  https://status.hetzner.com/incident/b90d7054-17af-4eae-a861-7ea9bfc489db
  Identified: 15 Dec, 21:31 UTC
  The network latency seems to have stabilized but is still higher than usual, and the network maintenance is still underway. The posted maintenance window is 2021-12-15 20:00 UTC+0 – 2021-12-16 15:00 UTC+0.
  Resolved: 16 Dec, 08:08 UTC
  The hosting provider has completed the scheduled maintenance (https://status.hetzner.com/incident/b90d7054-17af-4eae-a861-7ea9bfc489db).
  
  The network latency between Frankfurt and Falkenstein is back to normal levels (5ms ICMP ping, 10ms for a complete HTTP request).
Nov, 2021
1. Network issuesNotification SenderPing APIDashboard
  Started: 5 Nov, 12:27 UTC
  Duration: 42 minutes
  We are experiencing network connectivity issues – several brief periods of 100% packet loss between specific servers.
  
  There's a network incident at our hosting provider, likely related: https://status.hetzner.com/incident/d1e59188-9843-4801-a01e-a7ba5fd75940
  Resolved: 5 Nov, 13:09 UTC
  The issue is resolved.
Oct, 2021
1. Ping processing issuePing API
  Started: 11 Oct, 16:08 UTC
  Duration: 34 minutes
  At 16:08 UTC, one of the app servers encountered a brief connectivity issue to the database. To recover, a worker process had to be restarted. At the time it had ~4200 incoming pings queued in the memory, and these pings were lost during restart. The app server is now back to normal.
  Post-mortem
  Some more details on the issue.
  
  Healthchecks Ping API has a mitigation for times when the database is briefly unavailable (network outage, database restart, database failover). The mitigation is to queue incoming pings in memory, and keep retrying to connect to the database. Thanks to this mechanism, workers can usually recover from hiccups automatically – they reconnect to the database and work through the queue.
  
  Healthchecks Ping API used to respond with "200 OK" to client HTTP requests as soon a the request was received, before it was recorded to the database. The "200 OK" response effectively meant "request received, understood and queued".
  
  After deploying a recent new feature, slug URLs, this changed. Ping API now returns a response to the client only after the ping is recorded to the database. This means that during a database outage the number of open and hanging client connections can grow quickly. If the number of open connections reaches the "Max open files" limit, the worker process cannot recover any more – it cannot create another new connection to reconnect to the database. This is what happened during this brief outage: there was a network glitch, the worker exhausted the "Max open files" limit (4096) and got stuck in the endless loop trying to reconnect to the database. The immediate "fix" was to restart the worker process. And the proper fix was to increase the open files limit from 4096 to a much higher value – this is done now.
  
  If you received a false alert from Healthchecks.io about one of your checks being down around 16:10 UTC – this might be why. Apologies!
  Resolved: 11 Oct, 16:42 UTC
  The issue is resolved.