All systems operational.
All systems operational.
- Major stability problemsNotification SenderPing APIDashboardStarted:Duration:Major stability problems – investigatingPost-mortemHere's a quick recap of this outage.
Yesterday, Hetzner datacenters in Falkenstein were hit with by a large DDOS attack. As a mitigation, Hetzner throttled UDP traffic on ports 9000 and above.
Healthchecks.io uses Wireguard for private communication between servers (load balancers to web servers, web servers to database servers). Wireguard works over UDP, and, after the throttling started, the available bandwidth between servers dropped to below 1Mbit/s.
After figuring out what had happened, I updated Wireguard configuration to use a port number below 9000. After deploying the change, Healthchecks resumed normal operation.
The outage lasted almost 2 hours. During the outage, the ping API was accepting and processing some but not all pings. The web UI and the notification sender was completely non-operational. When normal operation resumed, Healthchecks sent out a wave of false alerts due to pings that were not received on time.
This was an unfortunate event, I apologize for the trouble caused by failing pings, non-operational management API, and the eventual false alerts. Still, there are also several positive aspects, in the "it could have been worse" sense, I would like to acknowledge:
PS. If you notice any lingering issues, have any suggestions or questions, please let me know at firstname.lastname@example.org. Thank you!
- TCP was still working. I could access the servers over SSH the whole time, so I had at least some control over the situation.
- The Wireguard port change worked as a workaround. Without it, the outage would have continued several more hours.
- The primary database server got a long overdue reboot, and is now running a newer kernel.
- When the problem hit, I was at home, awake, and able to respond immediately.
- Ping body processing backlogPing APIStarted:Duration:Our object storage provider is experiencing a degraded performance, there is currently a backlog of ping bodies not yet uploaded to object storage. No ping bodies are lost, and will be available for viewing and download eventually.
- Issues with ougoing emailsNotification SenderStarted:Duration:Unfortunately our SMTP provider is having another outage. Message from them: "Our server has a temporary delivery issue. Our developers are aware of them and they are working on a solution."
- Some messages to Signal users failingNotification SenderStarted:Duration:We're currently unable to send Signal messages to some Signal users (first-time messages to users that we have never sent messages in the past). The error messages from Signal indicate a rate-limiting issue. Previously, we could temporarily work around the rate-limiting by manually solving a CAPTCHA, but currently the issue persists even after solving CAPTCHA.
- Issues with ougoing emailsNotification SenderStarted:Duration:Our SMTP relay provider is experiencing issues, looking into it.
- Issues with outgoing emailNotification SenderStarted:Duration:We're seeing issues with outgoing email, looking into it.