Healthchecks.io 상태 - 사건 이력

Welcome to the Healthchecks.io status page. If there are interruptions to service, we will post a report here, and on our Mastodon account.

과거 사건

4월, 2022

3월, 2022

2월, 2022

4월, 2022
1. Major stability problemsNotification SenderPing APIDashboard
  시작됨: 28 4월, 20:30 UTC
  지속: 9시간, 21분
  Major stability problems – investigating
  Post-mortem
  Here's a quick recap of this outage.
  
  Yesterday, Hetzner datacenters in Falkenstein were hit with by a large DDOS attack. As a mitigation, Hetzner throttled UDP traffic on ports 9000 and above.
  
  Healthchecks.io uses Wireguard for private communication between servers (load balancers to web servers, web servers to database servers). Wireguard works over UDP, and, after the throttling started, the available bandwidth between servers dropped to below 1Mbit/s.
  
  After figuring out what had happened, I updated Wireguard configuration to use a port number below 9000. After deploying the change, Healthchecks resumed normal operation.
  
  The outage lasted almost 2 hours. During the outage, the ping API was accepting and processing some but not all pings. The web UI and the notification sender was completely non-operational. When normal operation resumed, Healthchecks sent out a wave of false alerts due to pings that were not received on time.
  
  This was an unfortunate event, I apologize for the trouble caused by failing pings, non-operational management API, and the eventual false alerts. Still, there are also several positive aspects, in the "it could have been worse" sense, I would like to acknowledge:
  
  TCP was still working. I could access the servers over SSH the whole time, so I had at least some control over the situation.
  The Wireguard port change worked as a workaround. Without it, the outage would have continued several more hours.
  The primary database server got a long overdue reboot, and is now running a newer kernel.
  When the problem hit, I was at home, awake, and able to respond immediately.
  
  PS. If you notice any lingering issues, have any suggestions or questions, please let me know at contact@healthchecks.io. Thank you!
  
  –Pēteris
  조사: 28 4월, 21:56 UTC
  We're still experiencing major issues. Ping handler is working somewhat, web dashboard is down.
  The root issue is a slowdown of UDP traffic between servers in the datacenter.
  
  Status update from Hetzner: https://status.hetzner.com/incident/129728ce-ba25-49b6-96cc-aafcd39ab0b7
  
  확인 중: 28 4월, 22:23 UTC
  Updated Wireguard configuration to use a port number below 9000. Service is back online, we're hopefully back on track.
  해결됨: 29 4월, 05:52 UTC
  The issue is resolved.
2. Ping body processing backlogPing API
  시작됨: 28 4월, 11:17 UTC
  지속: 3시간, 20분
  Our object storage provider is experiencing a degraded performance, there is currently a backlog of ping bodies not yet uploaded to object storage. No ping bodies are lost, and will be available for viewing and download eventually.
  해결됨: 28 4월, 14:37 UTC
  The issue is resolved.
3. Issues with ougoing emailsNotification Sender
  시작됨: 14 4월, 16:03 UTC
  지속: 2일, 16시간
  Unfortunately our SMTP provider is having another outage. Message from them: "Our server has a temporary delivery issue. Our developers are aware of them and they are working on a solution."
  확인 중: 14 4월, 19:28 UTC
  The SMTP service seems to be back up and more or less caught up with the backlog. The sending delay is still higher than usual.
  해결됨: 17 4월, 08:04 UTC
  The issue is resolved.
4. Some messages to Signal users failingNotification Sender
  시작됨: 11 4월, 04:44 UTC
  지속: 6시간, 35분
  We're currently unable to send Signal messages to some Signal users (first-time messages to users that we have never sent messages in the past). The error messages from Signal indicate a rate-limiting issue. Previously, we could temporarily work around the rate-limiting by manually solving a CAPTCHA, but currently the issue persists even after solving CAPTCHA.
  확인됨: 11 4월, 07:04 UTC
  The Signal servers apply rate-limits per sender account, and, obviously in retrospect, also per sender IP. If server A hits a rate-limit, and we submit a CAPTCHA solution from server B, it will not work. We're making changes on our side to take that into account.
  해결됨: 11 4월, 11:19 UTC
  The issue is resolved.
5. Issues with ougoing emailsNotification Sender
  시작됨: 5 4월, 08:30 UTC
  지속: 35분
  Our SMTP relay provider is experiencing issues, looking into it.
  해결됨: 5 4월, 09:06 UTC
  The issue is resolved.
6. Issues with outgoing emailNotification Sender
  시작됨: 4 4월, 11:43 UTC
  지속: 1시간, 40분
  We're seeing issues with outgoing email, looking into it.
  확인됨: 4 4월, 12:41 UTC
  Received an update from SMTP relay provider's support: "Our developers are currently performing a demanding system deploy which may be influencing these SSL errors."
  해결됨: 4 4월, 13:24 UTC
  Received an update from the SMTP relay provider (Elastic Email) – the email delivery issue has been resolved.
  
  During the incident, email delivery was failing intermittently, some connections to the SMTP relay went through, some failed. In total, 363 send attempts failed. Although Healthchecks retries failed email deliveries, the retry window is short, so some email messages were unfortunately lost during the outage.
  
  I will try to get more details about the outage from Elastic Email, and will investigate fallback options.
3월, 2022
모든 시스템이 작동 중입니다.
2월, 2022
1. Unreliable Signal notification deliveryNotification Sender
  시작됨: 18 2월, 07:04 UTC
  지속: 3일
  We've found a problem with Signal notification delivery: some messages don't get delivered, and trigger a "signal-cli call timed out" error message in the web UI. We are working on a fix.
  확인 중: 20 2월, 17:59 UTC
  Upgraded to the just released signal-cli version (0.10.4), which fixes the reliability problem.
  Also, added additional logging and alerting on our side, to catch similar issues sooner in the future.
  해결됨: 21 2월, 07:18 UTC
  The issue is resolved.