Saturday, February 17, 2018

A case of network throughput optimization

The company that I work for has servers in several countries, including Germany, China, USA and Malaysia. We run MySQL with replication, and also sometimes need to copy images of virtual machines or LXC containers between servers. And, until recently, this was painfully slow, except between Germany and USA. We often resorted to recreating virtual machines and containers from the same template and doing the same manipulations, instead of just copying the result (e.g. using rsync or scp). We often received Munin alerts about MySQL replication not working well (i.e.: a test UPDATE that is done every two minutes on the master is not visible on the slave), and could not do anything about it. Because, well, it is just a very slow (stabilizes at 5 Mbit/s or so between USA and Malaysia, and even worse between China and anything else) network, and it is not our network.

So, it looked sad, except that raw UDP tests performed using iperf indicated much higher bandwidth (95 Mbit/s between USA and Malaysia, with only 0.034% packet loss) than what was available for scp or for MySQL replication between the same servers. So, it was clearly the case that the usual "don't tune anything" advice is questionable here, and system could, in theory, work better.

For the record, the latency, as reported by ping between the servers in USA and Malaysia, is 217 ms.

The available guides for Linux network stack tuning usually begin with sysctls regarding various buffer sizes. E.g., setting net.core.rmem_max and net.core.wmem_max to bigger values based on the bandwidth-delay product. In my case, the estimated bandwidth-delay product (which is the same as the amount of data in flight) would be about 2.7 megabytes. So, setting both to 8388608 and retesting with a larger TCP window size (4 M) should be logical. Except, it didn't really work. The throughput was only 8 Mbit/s instead of 5. I didn't try to modify net.ipv4.tcp_rmem or net.ipv4.tcp_wmem because the default values were already of the correct order of magnitude.

Other guides, including the official one from RedHat, talk about things like NIC ring buffers, interrupts, adapter queues and offloading. But these things are relevant for multi-gigabit networks, not for the mere 95 Mbit/s that we are aiming at.

The thing that actually helped was to change the TCP congestion control algorithm. This algorithm is what decides when to speed up data transmission and when to slow it down.

Linux comes with many modules that implement TCP congestion control algorithms. And, in newer kernels, there are new algorithms and some improvements in the old ones. So, it pays off to install a new kernel. For Ubuntu 16.04, this means installing the linux-generic-hwe-16.04-edge package.

The available modules are in /lib/modules/`uname -r`/kernel/net/ipv4/ directory. Here is how to load them all, for testing purposes:

cd /lib/modules/`uname -r`/kernel/net/ipv4/
for mod in tcp_*.ko ; do modprobe -v ${mod%.ko} ; done

For each of the loaded congestion control algorithms, it is possible to run iperf with the --linux-congestion parameter to benchmark it. Here are the results in my case, as reported by the server, with 4 M window (changed by the kernel to 8 M).

bbr: 56.7 Mbits/sec
bic: 24.5 Mbits/sec
cdg: 0.891 Mbits/sec
cubic: 8.38 Mbits/sec
dctcp: 17.6 Mbits/sec
highspeed: 1.50 Mbits/sec
htcp: 3.55 Mbits/sec
hybla: 20.6 Mbits/sec
illinois: 7.24 Mbits/sec
lp: 2.13 Mbits/sec
nv: 1.47 Mbits/sec
reno: 2.36 Mbits/sec
scalable: 2.50 Mbits/sec
vegas: 1.51 Mbits/sec
veno: 1.70 Mbits/sec
westwood: 3.83 Mbits/sec
yeah: 3.20 Mbits/sec

The condition that the speeds mentioned above are from the server-side reports (iperf server is the receiver of the data) is important. The client always reports higher throughput. This happens because the kernel buffers client's data and says "it has been finished" even though a lot of data sits in the buffer still waiting to be sent. The server sees the actual duration of the transfer and is thus in the position to provide an accurate report.

A good question is whether a large window or net.core.rmem_max and net.core.wmem_max is really needed. I don't think that benchmarking all algorithms again makes sense, because bbr is the clear winner. Actually, for cdg, which is the worst algorithm according to the above benchmark, leaving the window size and r/wmem_max at their default values resulted in a speed boost to 6.53 Mbits/sec. And here are the results for bbr:

Default window size, default r/wmem_max: 56.0 Mbits/sec
Default window size (85 or 128 KB), 8M r/wmem_max: 55.4 Mbits/sec
4M window, 8M r/wmem_max: 56.7 Mbits/sec (copied from the above)

I.e.: in this case, the only tuning needed was to switch the TCP congestion control algorithm to something modern. We did not achieve the maximum possible throughput, but even this is a 10x improvement.

Here is how to make the changes persistent:

echo tcp_bbr > /etc/modules-load.d/tcp.conf
echo net.ipv4.tcp_congestion_control=bbr > /etc/sysctl.d/91-tcp.conf

There are some important notes regarding the bbr congestion control algorithm:

  1. It is only available starting with linux-4.9.
  2. In kernels before 4.13, it only operated correctly when combined with the "fq" qdisc.
  3. There are also important fixes, regarding recovery from the idle state of the connection, that happened in the 4.13 timeframe.
In other words, just use the latest kernel.

I will not repeat the mechanism due to which bbr is good on high-latency high-throughput slightly-lossy networks. Google's presentations do it better. Google uses it for youtube and other services, and it needs to be present on sender's side only. And it eliminated MySQL replication alerts for us. So maybe you should use it, too?

Thursday, June 30, 2016

If you want to run a business in China

...then you will need a Chinese phone number. I.e. a phone number with the country code +86. Your customers will use this number to reach your company, and you will use this number for outgoing calls to them, too.

There are many SIP providers that offer Chinese phone numbers, but not all of them are good. Here is why.

The phone system in China has an important quirk: it mangles Caller ID numbers on incoming international calls. This is not VoIP specific, and applies even to simple mobile-to-mobile calls. E.g., my mobile phone number in Russia starts with +7 953, and if I place a call to almost any other country, they will see that +7 953 XXX XXXX is calling. But, if I call a phone number in China, they will instead see something else, with no country code and no common suffix with my actual phone number.

The problem is that some SIP providers land calls to China (including calls from a Chinese number obtained from their pool) on gateways that are outside China. If you use such provider and call a Chinese customer, they will not recognize you, because the call will be treated as international (even though it is intended to be between two Chinese phone numbers), and your caller ID will be mangled.

As far as I know, there is no way to tell if a SIP provider is affected by this problem, without trying their service or calling their support.

Tuesday, May 24, 2016

Is TSX busted on Skylake, too? No, it's just buggy software

The story about Intel recalling Transactional Synchronization Extensions
from Haswell and Broadwell lines of their CPUs by means of a microcode update has hit the web in the past. But it looks like this is not the end of the story.

The company I work for has a development server in Hetzner, and it uses this type of CPU:

processor : 0
vendor_id : GenuineIntel
cpu family : 6
model  : 94
model name : Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz
stepping : 3
microcode : 0x39
cpu MHz  : 3825.265
cache size : 8192 KB
physical id : 0
siblings : 8
core id  : 0
cpu cores : 4
apicid  : 0
initial apicid : 0
fpu  : yes
fpu_exception : yes
cpuid level : 22
wp  : yes
flags  : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb 
rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology 
nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est 
tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt 
tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch intel_pt 
tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep 
bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt xsaveopt xsavec xgetbv1 
dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp
bugs  :
bogomips : 6816.61
clflush size : 64
cache_alignment : 64
address sizes : 39 bits physical, 48 bits virtual
power management:

I.e. it is a Skylake. The server is running Ubuntu 16.04, and the CPU has HLE and RTM families of instructions.

One of my recent tasks was to prepare, on this server, an LXC container based on Ubuntu 16.04 with a lightweight desktop accessible over VNC, for "remote classroom" purposes. We already have such containers on other servers, but they were based on Ubuntu 14.04. Such containers work well on this server, too, but it's time to upgrade. In these old containers, we use a regular Xorg server with a "dummy" video driver, and export the screen using x11vnc.

So, I decided to clone the old container and update Ubuntu there. Result: x11vnc, or sometimes Xorg, now crashes (SIGSEGV) when one attempts to change the desktop resolution. The backtrace points into the __lll_unlock_elision() function which is a part of glibc implementation of mutexes for CPUs with Hardware Lock Elision instructions.

This crash doesn't happen when I run the same container on a server with an older CPU (which doesn't have TSX in the first place), or if I try to reproduce the bug at home (where I have a Haswell, with TSX disabled by the new microcode).

So, all apparently points to a bug related to these extensions. Or does it?

The __lll_unlock_elision() function has this helpful comment in it:

  /* When the lock was free we're in a transaction.
     When you crash here you unlocked a free lock.  */

And indeed, there is some discussion of another crash in __lll_unlock_elision(), related to NVidia driver (which is not used here). In that discussion, it was highlighted that an unlock of already-unlocked mutex would be silently ignored if a mutex implementation not optimized for TSX is used, but a CPU with TSX would expose such latent bug. Locking balance bugs are easily verified using valgrind. And indeed:

DISPLAY=:1 valgrind --tool=helgrind x11vnc
==4209== ---Thread-Announcement------------------------------------------
==4209== Thread #1 is the program's root thread
==4209== ----------------------------------------------------------------
==4209== Thread #1 unlocked a not-locked lock at 0x9CDA00
==4209==    at 0x4C326B4: ??? (in /usr/lib/valgrind/
==4209==    by 0x4556B2: ??? (in /usr/bin/x11vnc)
==4209==    by 0x45A35E: ??? (in /usr/bin/x11vnc)
==4209==    by 0x466646: ??? (in /usr/bin/x11vnc)
==4209==    by 0x410E30: ??? (in /usr/bin/x11vnc)
==4209==    by 0x717D82F: (below main) (libc-start.c:291)
==4209==  Lock at 0x9CDA00 was first observed
==4209==    at 0x4C360BA: pthread_mutex_init (in /usr/lib/valgrind/
==4209==    by 0x40FECC: ??? (in /usr/bin/x11vnc)
==4209==    by 0x717D82F: (below main) (libc-start.c:291)
==4209==  Address 0x9cda00 is in the BSS segment of /usr/bin/x11vnc

It is a software bug, not a CPU bug. But still - until such bugs are eliminated from the distribution, I'd rather not use it on a server with a CPU with TSX.

Sunday, May 8, 2016

Root filesystem snapshots and kernel upgrades

On my laptop (which is running Arch), I decided to have periodic snapshots of the filesystem, in order to revert bad upgrades (especially those involving a large and unknown set of interdependent packages) easily. My toolset for this task is LVM2 and Snapper. Yes, I know that LVM2 is kind-of discouraged, and Snapper also supports btrfs, but most of the points below apply to btrfs, too.

Snapper, when used with LVM2, requires not just LVM2, but thinly-provisioned LVM2 volumes. Fortunately, Arch can have root filesystem on such volumes, so this is not a problem.

So, I have /boot on /dev/sda1, LVM on LUKS on /dev/sda2, root on a thinly-provisioned logical volume, and /home on another thinly-provisioned volume. And also swap on a non-thinly-provisioned volume. A separate /boot partition is needed because boot loaders generally don't understand thinly-provisioned LVM volumes, especially on encrypted disks. A separate volume for /home is needed because I don't want my new files in /home to be lost if I revert the system to its old snapshot. The same need to make a separate volume applies to other directories that contain data that should be preserved, but there are no such directories on my laptop. They can appear if I install e.g PostgreSQL.

And now there is a problem. Rollback to a snapshot works, but only if there were no kernel updates between the time when the snapshot was taken and when an attempt to revert was made. The root cause is that the kernel image is in /boot, and loadable modules for it are in /usr/lib/modules. The modules are reverted, but the boot loader still loads a new kernel, which now has no corresponding modules.

There are two solutions: either revert the kernel and its initramfs, too, when reverting the root file system, or make sure that modules are not reverted. I have not investigated how to make the first option possible, even though it would be a perfect solution. However, I have tried to make sure that modules are not reverted, and I am not satisfied with the result.

The idea was to move modules to /boot/modules, and make this location available somehow as /usr/lib/modules. Here "somehow" can mean either a symlink, or a bind mount. A symlink doesn't work, because the kernel upgrade in Arch will restore it back to a directory. A bind mount doesn't work, either. The issue is that, by putting modules on non-root filesystem, one creates a circular dependency between local filesystem mounting and udev (this would apply to a symlink, too).

Indeed, systemd-udevd, on startup, maps the /usr/lib/modules/`uname -r`/modules.alias.bin file into memory. So, now it has a (real) dependency on /usr/lib/modules being mounted. However, mounting local filesystems from /etc/fstab sometimes depends on systemd-udevd, because of device nodes. So, bind-mounting /usr/lib/modules merely from /etc/fstab, using built-in systemd tools, cannot work.

But it can work from a wrapper that starts before the real init:

mount -n /boot              # /dev/sda1 is in devtmpfs and doesn't need udev
mount -n /usr/lib/modules   # there is still a line in fstab about that
exec /sbin/init "$@" 

But that's ugly. In the end, I removed the wrapper, installed an old known-working "linux" package, made a copy of the kernel, its initramfs and modules, upgraded the kernel again, and put the saved files back, so that they are now not controlled by the package manager. So now I have a known good kernel down in the boot menu, and knowledge that its modules will always be present in my root filesystem if I don't revert further than up to today's state.

And now one final remark. Remember that I said: "The same need to make a separate volume applies to other directories that contain data that should be preserved"? There is a temptation to apply this to the whole /var directory, but that would be wrong. If a system is being reverted to its old snapshot, a package database (which is in /var/lib/pacman) should be reverted, too. But /var/lib/pacman is under /var.

The conclusion is that Linux plumbers should think a bit about this "revert the whole system" use case, and maybe move some directories.

Sunday, December 20, 2015

Ready to drop Gentoo

I was a Gentoo user since 2010. For me, it was, at that time, a source of fresh, well-maintained packages, without the multimedia related US-lawyer-induced brain damage that plagued Debian. Also, by compiling the packages on my local PC, it neatly sidestepped legal problems related to redistribution of GPL-ed packages with GPL-incompatible dependencies, and trademark issues related to Mozilla products. Also, it offered enough choice in the form of USE flags to sidestep too-raw technologies.

Today, I am re-evaluating this decision. I still care about perfect multimedia support, even if relies on technologies that are illegal in some country (even if that country is my own). I still care about Firefox identifying itself as Firefox in the User-Agent header, as to avoid broken sites (such as, but I don't want to use binaries from Mozilla, because they rely on outdated technology (i.e. are appropriate to something like RHEL 5). And, obviously, I care about modern and bug-free packages, or at least about non-upstream bugs (and, ideally, upstream bugs, too) being fixed promptly.

Also, I rely on a feature that is not found upstream in any desktop environment anymore: full-screen color correction, even in games. Yes, I have a colorimeter.

This was necessary with my old Sony VAIO Z23A4R laptop, because it had a wide-gamut screen (94% coverage of Adobe RGB) and produced very oversaturated colors by default. This is also necessary on my new laptop, Lenovo Ideapad Yoga 2 Pro, because otherwise it is very hard to convince it to display the yellow color. Contrary to popular claims, it can display yellow, even in Linux, given the exact RGB values, but even slight changes (that would only produce a slightly different shade of yellow on normal screens) cause it to display either yellowish-red or yellowish-green color.

So, it must be easy for me to install extra packages (such as CompICC) from source, and, ideally, have them integrated into package management. And, the less the number of such extra packages needed for full-screen color correction, the better.

Now back to Gentoo. It still allows me to ignore lawyers, too-radical Free Software proponents, and their crippling effect on the software that I want to use. It, mostly, still allows me to take suspicious too-new infrastructure out of the equation. For full-screen color correction, I need exactly one ebuild that is not in the main Portage tree (CompICC). But other packages started to suffer from bitrot.

Problem 1: MATE desktop environment stuck at version 1.8. Probably just due to lack of manpower to review the updates. This is bug 551588.
Problem 2: Attempt to upgrade GNOME to version 3.18 brought in a lot of C++11 related breakage that wasn't handled promptly enough, e.g., by reverting the upgrade. This is bug 566328.
Problem 3: QEMU will not let Windows 8 guests to use resolutions higher than 1024x768. Upstream QEMU does not have this bug - it is an invention of overzealous unbundling that replaced a perfectly working bundled version of VGA BIOS with an inferior copy of Bochs VGA BIOS. This is bug 529862.

I don't yet know which Linux distribution I will use. Maybe Arch (but it requires so much stuff from AUR to build CompICC! maybe I should use Compiz-CMS instead), maybe something else. We'll see.

Sunday, October 18, 2015

Still using for recruiting? Think again!

If your company has open vacancies and uses some system for pre-screening candidates (e.g. by giving them questions), I have a "small" task for you. Go to your system, answer the questions as if you were a candidate, validate the answers as you would expect from a candidate (e.g. actually perform the actions that the answer describes), and then save the results. Look at the whole process. Make a conclusion for yourself whether your system is usable for the stated purpose. Communicate it to your management, if needed.

If you are using for hiring technical candidates, the answer is most probably "not suitable at all".

The most annoying bug that has is that it does not allow the candidate to enter certain characters in certain positions. The exact error message is:
Q3 2 Contains invalid characters. You cannot use the characters: ' " \ / or ` in an enclosing instance of <>, <<, >> or ><.
 This triggers at least on the following types of input:
  • XML or HTML
  • Command redirections, e.g.: echo "foo bar" >> baz.txt
  • Sequences of menu items to click, e.g.: "File > New > Folder", if a bad character happens to be before that
So, you cannot ask questions about HTML, shell scripting, or even general questions about using GUI-based applications.

This error message probably means that they are concerned about XSS attacks. However, filtering out invalid characters is a very sloppy way of protection against such attacks. And it imposes completely unreasonable restrictions on the user input.

In fact, any kind of input (including XML, bash scripts or text about clicking the menu) should be suitable, and can be made to display safely and properly in any browser, just by escaping the special characters when generating the HTML page. Many template engines exist that do this escaping for you automatically. Today, there is simply no reason not to use them.

If a candidate sees such error, he/she becomes demotivated. It is a stupid barrier before getting the correct answer to you. It also indicates that you don't care about your customers (by choosing business partners that allow such sloppy practices). Worse, some of your candidates (who see for the first time) can think that it is your product, or your internal system, and that you (not have web developers with insufficient skills. I.e. that your company is not good enough to work in, because you don't weed out underqualified workers.

You don't want to lose candidates. So you don't want to use Really.

Monday, September 15, 2014

Why static analyzers should see all the code

Just for fun, I decided to run a new "standard markdown" C code through a static analyzer provided by the Clang project. On the surface, this looks very easy:

CCC_CC=clang scan-build make stmd

It even finds bugs. A lot of dead assignments, and some logic & memory errors: dereferencing a null pointer, memory leaks and a double-free. However, are they real?

E.g., it complains that the following piece of code in src/bstrlib.c introduces a possible leak of memory pointed by buff which was previously allocated in the same function:

bdestroy (buff);
return ret;

It does not understand that bdestroy is a memory deallocation function. Indeed, it could be anything. It could be defined in a different file. It indeed does not destroy the buffer and thus leaks the memory if some integrity error occurs (and the return code is never checked).

So indeed, the code of bdestroy smells somewhat. But is it a problem? How can we trick clang into understanding that this can't happen?

Part of the problem stems from the fact that clang looks at one file at a time and thus does not understand dependencies between functions defined in different files. There is, however, a way to fix it.

All we need to do is to create a C source file that includes all other C source files. Let's call it "all.c".

#include "blocks.c"
#include "bstrlib.c"
#include "detab.c"
#include "html.c"
#include "inlines.c"
#include "main.c"
#include "print.c"
#include "scanners.c"
#include "utf8.c"

Unfortunately, it does not compile out of the box, because of the conflicting "advance" macros in inlines.c and utf8.c (fixable by undefining these macros at the end of each file), and because of the missing header guard around stmd.h (fixable trivially by adding it). With that, one can submit this all-inclusive file to the static analyzer:

scan-build clang -g -O3 -Wall -std=c99 -c -o src/all.o src/all.c

Result: no bugs found, except dead assignments.