Healthchecks.io Status - Incident history

Welcome to the Healthchecks.io status page. If there are interruptions to service, we will post a report here, and on our Mastodon account.

Past incidents

Nov, 2021

Oct, 2021

Sep, 2021

Nov, 2021
1. Network issuesNotification SenderPing APIDashboard
  Started: 5 Nov, 12:27 UTC
  Duration: 42 minutes
  We are experiencing network connectivity issues – several brief periods of 100% packet loss between specific servers.
  
  There's a network incident at our hosting provider, likely related: https://status.hetzner.com/incident/d1e59188-9843-4801-a01e-a7ba5fd75940
  Resolved: 5 Nov, 13:09 UTC
  The issue is resolved.
Oct, 2021
1. Ping processing issuePing API
  Started: 11 Oct, 16:08 UTC
  Duration: 34 minutes
  At 16:08 UTC, one of the app servers encountered a brief connectivity issue to the database. To recover, a worker process had to be restarted. At the time it had ~4200 incoming pings queued in the memory, and these pings were lost during restart. The app server is now back to normal.
  Post-mortem
  Some more details on the issue.
  
  Healthchecks Ping API has a mitigation for times when the database is briefly unavailable (network outage, database restart, database failover). The mitigation is to queue incoming pings in memory, and keep retrying to connect to the database. Thanks to this mechanism, workers can usually recover from hiccups automatically – they reconnect to the database and work through the queue.
  
  Healthchecks Ping API used to respond with "200 OK" to client HTTP requests as soon a the request was received, before it was recorded to the database. The "200 OK" response effectively meant "request received, understood and queued".
  
  After deploying a recent new feature, slug URLs, this changed. Ping API now returns a response to the client only after the ping is recorded to the database. This means that during a database outage the number of open and hanging client connections can grow quickly. If the number of open connections reaches the "Max open files" limit, the worker process cannot recover any more – it cannot create another new connection to reconnect to the database. This is what happened during this brief outage: there was a network glitch, the worker exhausted the "Max open files" limit (4096) and got stuck in the endless loop trying to reconnect to the database. The immediate "fix" was to restart the worker process. And the proper fix was to increase the open files limit from 4096 to a much higher value – this is done now.
  
  If you received a false alert from Healthchecks.io about one of your checks being down around 16:10 UTC – this might be why. Apologies!
  Resolved: 11 Oct, 16:42 UTC
  The issue is resolved.
Sep, 2021
All systems operational.