Welcome to the Healthchecks.io status page. If there are interruptions to service, we will post a report here, and on our Mastodon account.
Past incidents
Dec, 2021
- Started:Duration:
- Started:Duration:
Nov, 2021
- Started:Duration:
Oct, 2021
- Post-mortemPing processing issuePing APIStarted:Duration:Some more details on the issue.
Healthchecks Ping API has a mitigation for times when the database is briefly unavailable (network outage, database restart, database failover). The mitigation is to queue incoming pings in memory, and keep retrying to connect to the database. Thanks to this mechanism, workers can usually recover from hiccups automatically – they reconnect to the database and work through the queue.
Healthchecks Ping API used to respond with "200 OK" to client HTTP requests as soon a the request was received, before it was recorded to the database. The "200 OK" response effectively meant "request received, understood and queued".
After deploying a recent new feature, slug URLs, this changed. Ping API now returns a response to the client only after the ping is recorded to the database. This means that during a database outage the number of open and hanging client connections can grow quickly. If the number of open connections reaches the "Max open files" limit, the worker process cannot recover any more – it cannot create another new connection to reconnect to the database. This is what happened during this brief outage: there was a network glitch, the worker exhausted the "Max open files" limit (4096) and got stuck in the endless loop trying to reconnect to the database. The immediate "fix" was to restart the worker process. And the proper fix was to increase the open files limit from 4096 to a much higher value – this is done now.
If you received a false alert from Healthchecks.io about one of your checks being down around 16:10 UTC – this might be why. Apologies!