Saturday, February 17, 2018

A case of network throughput optimization

The company that I work for has servers in several countries, including Germany, China, USA and Malaysia. We run MySQL with replication, and also sometimes need to copy images of virtual machines or LXC containers between servers. And, until recently, this was painfully slow, except between Germany and USA. We often resorted to recreating virtual machines and containers from the same template and doing the same manipulations, instead of just copying the result (e.g. using rsync or scp). We often received Munin alerts about MySQL replication not working well (i.e.: a test UPDATE that is done every two minutes on the master is not visible on the slave), and could not do anything about it. Because, well, it is just a very slow (stabilizes at 5 Mbit/s or so between USA and Malaysia, and even worse between China and anything else) network, and it is not our network.

So, it looked sad, except that raw UDP tests performed using iperf indicated much higher bandwidth (95 Mbit/s between USA and Malaysia, with only 0.034% packet loss) than what was available for scp or for MySQL replication between the same servers. So, it was clearly the case that the usual "don't tune anything" advice is questionable here, and system could, in theory, work better.

For the record, the latency, as reported by ping between the servers in USA and Malaysia, is 217 ms.

The available guides for Linux network stack tuning usually begin with sysctls regarding various buffer sizes. E.g., setting net.core.rmem_max and net.core.wmem_max to bigger values based on the bandwidth-delay product. In my case, the estimated bandwidth-delay product (which is the same as the amount of data in flight) would be about 2.7 megabytes. So, setting both to 8388608 and retesting with a larger TCP window size (4 M) should be logical. Except, it didn't really work. The throughput was only 8 Mbit/s instead of 5. I didn't try to modify net.ipv4.tcp_rmem or net.ipv4.tcp_wmem because the default values were already of the correct order of magnitude.

Other guides, including the official one from RedHat, talk about things like NIC ring buffers, interrupts, adapter queues and offloading. But these things are relevant for multi-gigabit networks, not for the mere 95 Mbit/s that we are aiming at.

The thing that actually helped was to change the TCP congestion control algorithm. This algorithm is what decides when to speed up data transmission and when to slow it down.

Linux comes with many modules that implement TCP congestion control algorithms. And, in newer kernels, there are new algorithms and some improvements in the old ones. So, it pays off to install a new kernel. For Ubuntu 16.04, this means installing the linux-generic-hwe-16.04-edge package.

The available modules are in /lib/modules/`uname -r`/kernel/net/ipv4/ directory. Here is how to load them all, for testing purposes:

cd /lib/modules/`uname -r`/kernel/net/ipv4/
for mod in tcp_*.ko ; do modprobe -v ${mod%.ko} ; done

For each of the loaded congestion control algorithms, it is possible to run iperf with the --linux-congestion parameter to benchmark it. Here are the results in my case, as reported by the server, with 4 M window (changed by the kernel to 8 M).

bbr: 56.7 Mbits/sec
bic: 24.5 Mbits/sec
cdg: 0.891 Mbits/sec
cubic: 8.38 Mbits/sec
dctcp: 17.6 Mbits/sec
highspeed: 1.50 Mbits/sec
htcp: 3.55 Mbits/sec
hybla: 20.6 Mbits/sec
illinois: 7.24 Mbits/sec
lp: 2.13 Mbits/sec
nv: 1.47 Mbits/sec
reno: 2.36 Mbits/sec
scalable: 2.50 Mbits/sec
vegas: 1.51 Mbits/sec
veno: 1.70 Mbits/sec
westwood: 3.83 Mbits/sec
yeah: 3.20 Mbits/sec

The condition that the speeds mentioned above are from the server-side reports (iperf server is the receiver of the data) is important. The client always reports higher throughput. This happens because the kernel buffers client's data and says "it has been finished" even though a lot of data sits in the buffer still waiting to be sent. The server sees the actual duration of the transfer and is thus in the position to provide an accurate report.

A good question is whether a large window or net.core.rmem_max and net.core.wmem_max is really needed. I don't think that benchmarking all algorithms again makes sense, because bbr is the clear winner. Actually, for cdg, which is the worst algorithm according to the above benchmark, leaving the window size and r/wmem_max at their default values resulted in a speed boost to 6.53 Mbits/sec. And here are the results for bbr:

Default window size, default r/wmem_max: 56.0 Mbits/sec
Default window size (85 or 128 KB), 8M r/wmem_max: 55.4 Mbits/sec
4M window, 8M r/wmem_max: 56.7 Mbits/sec (copied from the above)

I.e.: in this case, the only tuning needed was to switch the TCP congestion control algorithm to something modern. We did not achieve the maximum possible throughput, but even this is a 10x improvement.

Here is how to make the changes persistent:

echo tcp_bbr > /etc/modules-load.d/tcp.conf
echo net.ipv4.tcp_congestion_control=bbr > /etc/sysctl.d/91-tcp.conf

There are some important notes regarding the bbr congestion control algorithm:

  1. It is only available starting with linux-4.9.
  2. In kernels before 4.13, it only operated correctly when combined with the "fq" qdisc.
  3. There are also important fixes, regarding recovery from the idle state of the connection, that happened in the 4.13 timeframe.
In other words, just use the latest kernel.

I will not repeat the mechanism due to which bbr is good on high-latency high-throughput slightly-lossy networks. Google's presentations do it better. Google uses it for youtube and other services, and it needs to be present on sender's side only. And it eliminated MySQL replication alerts for us. So maybe you should use it, too?