Network issuesNotification SenderPing APIDashboardStarted: Varte:We are experiencing network connectivity issues – several brief periods of 100% packet loss between specific servers.
There's a network incident at our hosting provider, likely related: https://status.hetzner.com/incident/d1e59188-9843-4801-a01e-a7ba5fd75940
Ping processing issuePing APIStarted: Varte:At 16:08 UTC, one of the app servers encountered a brief connectivity issue to the database. To recover, a worker process had to be restarted. At the time it had ~4200 incoming pings queued in the memory, and these pings were lost during restart. The app server is now back to normal.Post mortemSome more details on the issue.
Healthchecks Ping API has a mitigation for times when the database is briefly unavailable (network outage, database restart, database failover). The mitigation is to queue incoming pings in memory, and keep retrying to connect to the database. Thanks to this mechanism, workers can usually recover from hiccups automatically – they reconnect to the database and work through the queue.
Healthchecks Ping API used to respond with "200 OK" to client HTTP requests as soon a the request was received, before it was recorded to the database. The "200 OK" response effectively meant "request received, understood and queued".
After deploying a recent new feature, slug URLs, this changed. Ping API now returns a response to the client only after the ping is recorded to the database. This means that during a database outage the number of open and hanging client connections can grow quickly. If the number of open connections reaches the "Max open files" limit, the worker process cannot recover any more – it cannot create another new connection to reconnect to the database. This is what happened during this brief outage: there was a network glitch, the worker exhausted the "Max open files" limit (4096) and got stuck in the endless loop trying to reconnect to the database. The immediate "fix" was to restart the worker process. And the proper fix was to increase the open files limit from 4096 to a much higher value – this is done now.
If you received a false alert from Healthchecks.io about one of your checks being down around 16:10 UTC – this might be why. Apologies!
Alle systemer i drift.