On May 20, our primary database server experienced packet loss and latency issues. To ensure normal operation of the service, the database was failed-over to a hot standby. The fail-over went fine, aside from a minor issue with one fat-fingered firewall rule. Here are additional details about the network issue and the fail-over process.
The symptoms: network latency and packet loss on the database host shoots up for 1-2 minutes, then everything goes back to normal. This had occurred a few times in the past weeks already, causing ping processing delays and subsequent "X is down" alerts each time.
This issue has been hard to troubleshoot because it seemed to happen at random times, and lasted only for a couple minutes each time. I was running a few monitoring tools continuously: Netdata agent including the fping plugin, Opsdash agent, and mtr in a loop logging to a text file. I could also inspect application logs for clues.
To illustrate the elevated latency, here's how pinging 188.8.131.52 from the host looks normally:
$ ping 184.108.40.206
PING 220.127.116.11 (18.104.22.168) 56(84) bytes of data.
64 bytes from 22.214.171.124: icmp_seq=1 ttl=57 time=4.93 ms
64 bytes from 126.96.36.199: icmp_seq=2 ttl=57 time=4.96 ms
64 bytes from 188.8.131.52: icmp_seq=3 ttl=57 time=4.94 ms
64 bytes from 184.108.40.206: icmp_seq=4 ttl=57 time=4.95 ms
And here's the same command during the problematic 2-minute window:
$ ping 220.127.116.11
PING 18.104.22.168 (22.214.171.124) 56(84) bytes of data.
64 bytes from 126.96.36.199: icmp_seq=1 ttl=57 time=104 ms
64 bytes from 188.8.131.52: icmp_seq=3 ttl=57 time=104 ms
64 bytes from 184.108.40.206: icmp_seq=4 ttl=57 time=102 ms
64 bytes from 220.127.116.11: icmp_seq=5 ttl=57 time=104 ms
64 bytes from 18.104.22.168: icmp_seq=6 ttl=57 time=104 ms
64 bytes from 22.214.171.124: icmp_seq=7 ttl=57 time=104 ms
64 bytes from 126.96.36.199: icmp_seq=8 ttl=57 time=104 ms
64 bytes from 188.8.131.52: icmp_seq=9 ttl=57 time=100 ms
64 bytes from 184.108.40.206: icmp_seq=11 ttl=57 time=104 ms
64 bytes from 220.127.116.11: icmp_seq=12 ttl=57 time=104 ms
64 bytes from 18.104.22.168: icmp_seq=13 ttl=57 time=104 ms
64 bytes from 22.214.171.124: icmp_seq=14 ttl=57 time=104 ms
64 bytes from 126.96.36.199: icmp_seq=16 ttl=57 time=104 ms
--- 188.8.131.52 ping statistics ---
16 packets transmitted, 13 received, 18% packet loss, time 14996ms
rtt min/avg/max/mdev = 100.713/104.111/104.729/1.207 ms
When the latency from app servers and the database suddenly goes from 0.5ms to 100ms, there are going to be issues. The most pressing issue was the processing of incoming pings. On app servers, each received ping is put in a queue. A single worker process (a goroutine, to be exact) takes items from the queue and inserts them in the database. This is a sequential process and the latency to the database puts a limit on how many pings can be processed per second. To illustrate, under normal operation the worker can process all incoming pings without a backlog building up:
May 22 13:06:00 www1 hchk: 47 pings/s
May 22 13:06:01 www1 hchk: 105 pings/s
May 22 13:06:02 www1 hchk: 328 pings/s
May 22 13:06:03 www1 hchk: 265 pings/s
May 22 13:06:04 www1 hchk: 108 pings/s
May 22 13:06:05 www1 hchk: 72 pings/s
During the high-latency period, throughput drops significantly and backlog starts to build up (timestamps in CEST):
May 20 17:39:49 www1 hchk: 5 pings/s, queued 47, dwell time 2373ms
May 20 17:39:50 www1 hchk: 5 pings/s, queued 68, dwell time 3358ms
May 20 17:39:51 www1 hchk: 6 pings/s, queued 104, dwell time 3696ms
May 20 17:39:52 www1 hchk: 3 pings/s, queued 180, dwell time 4624ms
May 20 17:39:53 www1 hchk: 5 pings/s, queued 241, dwell time 5268ms
May 20 17:39:54 www1 hchk: 3 pings/s, queued 268, dwell time 6116ms
May 20 17:39:55 www1 hchk: 4 pings/s, queued 292, dwell time 6975ms
May 20 17:39:56 www1 hchk: 5 pings/s, queued 325, dwell time 7819ms
May 20 17:39:57 www1 hchk: 3 pings/s, queued 340, dwell time 8445ms
When the dwell time (the age of the oldest item in the queue) goes above 15 seconds, the app server declares itself unhealthy. When all app servers are unhealthy, clients start getting "HTTP 503 Service Unavailable" errors.
Even if the network hiccup is brief, and no client request gets rejected, there is still a negative consequence: there is a delay from the client making a HTTP request, to the ping getting registered in the database. For a check with a tight timing, this additional delay can push it over the limit and cause a false alert.
On May 20, after another latency spike, I decided to fail-over the database to a hot standby. I didn't know what was ultimately causing the network issues, but the hope was that moving to a different physical host will make things better.
The fail-over procedure is simple:
The application servers didn't need to be updated or restarted – they knew IP addresses of both database hosts, and could automatically cycle between them until they found one that was alive and was accepting writes. More details on this in PostgreSQL documentation: Specifying Multiple Hosts and target_session_attrs.
The only issue I ran into during the fail-over was with a firewall rule denying one of the app servers access: I had mistyped port "5432" as "4322", which was an easy fix.
The service was back to normal operation, but the database host was now lacking a hot standby. Luckily, I had a fresh system (sporting Ryzen 3700X) already ordered and waiting on the sidelines. I used provisioning scripts to set up a hot standby on it, but checking and double-checking everything still took close to an hour.
I tried contacting Hetzner Support about the packet the loss and latency jump. I provided mtr reports (which they always ask for), and ping logs showing latency. After some forth-and-back ultimately the response I got was "There is and was no general network issue known. Also we do not see any packet loss from here."
To be fair, this could have been an issue with the hardware or software on the physical host – I can't say for sure.
As mentioned earlier, each app server processes pings one-by-one, over a single database connection. When a single operation blocks, the whole queue gets stuck. What if we used a connection pool? Or processed pings in batches – say, 100 at a time? Or added a dedicated queue component to the mix?
These are all interesting ideas worth of consideration. I've already done some work on the batch processing idea and have promising results. That being said, packet loss would be problematic even with these workarounds. It would need to be solved anyway.
As always, I will continue to look for any opportunities to make the service more robust.