Healthchecks.io Status

Welcome to the Healthchecks.io status page. If there are interruptions to service, we will post a report here.

Current status

Updated just now

Dashboard

Operational

Ping API

Operational

Notification Sender

Operational

System metricsDay  |  Week  |  Month

  1. Processed pings

    256 pings/s

  2. Queued incoming pings

    0 in queue

  3. Notifications sent

    14.6 notifications/min

  4. Queued outgoing notifications

    0 in queue

Past incidents

  1. Today

    All systems operational.

  2. Oct 18

    All systems operational.

  3. Oct 17

    All systems operational.

  4. Oct 16

    All systems operational.

  5. Oct 15

    All systems operational.

  6. Oct 14

    All systems operational.

  7. Oct 13

    All systems operational.

  8. Oct 12

    All systems operational.

  9. Oct 11

    1. Ping processing issuePing API

      Started: Duration:
      At 16:08 UTC, one of the app servers encountered a brief connectivity issue to the database. To recover, a worker process had to be restarted. At the time it had ~4200 incoming pings queued in the memory, and these pings were lost during restart. The app server is now back to normal.
      Post-mortem
      Some more details on the issue.

      Healthchecks Ping API has a mitigation for times when the database is briefly unavailable (network outage, database restart, database failover). The mitigation is to queue incoming pings in memory, and keep retrying to connect to the database. Thanks to this mechanism, workers can usually recover from hiccups automatically – they reconnect to the database and work through the queue.

      Healthchecks Ping API used to respond with "200 OK" to client HTTP requests as soon a the request was received, before it was recorded to the database. The "200 OK" response effectively meant "request received, understood and queued".

      After deploying a recent new feature, slug URLs, this changed. Ping API now returns a response to the client only after the ping is recorded to the database. This means that during a database outage the number of open and hanging client connections can grow quickly. If the number of open connections reaches the "Max open files" limit, the worker process cannot recover any more – it cannot create another new connection to reconnect to the database. This is what happened during this brief outage: there was a network glitch, the worker exhausted the "Max open files" limit (4096) and got stuck in the endless loop trying to reconnect to the database. The immediate "fix" was to restart the worker process. And the proper fix was to increase the open files limit from 4096 to a much higher value – this is done now.

      If you received a false alert from Healthchecks.io about one of your checks being down around 16:10 UTC – this might be why. Apologies!

      Resolved:

      The issue is resolved.
  10. Oct 10

    All systems operational.

  11. Oct 9

    All systems operational.

  12. Oct 8

    All systems operational.

  13. Oct 7

    All systems operational.

  14. Oct 6

    All systems operational.