Tuesday, May 24, 2016

Is TSX busted on Skylake, too? No, it's just buggy software

The story about Intel recalling Transactional Synchronization Extensions
from Haswell and Broadwell lines of their CPUs by means of a microcode update has hit the web in the past. But it looks like this is not the end of the story.

The company I work for has a development server in Hetzner, and it uses this type of CPU:


processor : 0
vendor_id : GenuineIntel
cpu family : 6
model  : 94
model name : Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz
stepping : 3
microcode : 0x39
cpu MHz  : 3825.265
cache size : 8192 KB
physical id : 0
siblings : 8
core id  : 0
cpu cores : 4
apicid  : 0
initial apicid : 0
fpu  : yes
fpu_exception : yes
cpuid level : 22
wp  : yes
flags  : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb 
rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology 
nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est 
tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt 
tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch intel_pt 
tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep 
bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt xsaveopt xsavec xgetbv1 
dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp
bugs  :
bogomips : 6816.61
clflush size : 64
cache_alignment : 64
address sizes : 39 bits physical, 48 bits virtual
power management:


I.e. it is a Skylake. The server is running Ubuntu 16.04, and the CPU has HLE and RTM families of instructions.

One of my recent tasks was to prepare, on this server, an LXC container based on Ubuntu 16.04 with a lightweight desktop accessible over VNC, for "remote classroom" purposes. We already have such containers on other servers, but they were based on Ubuntu 14.04. Such containers work well on this server, too, but it's time to upgrade. In these old containers, we use a regular Xorg server with a "dummy" video driver, and export the screen using x11vnc.

So, I decided to clone the old container and update Ubuntu there. Result: x11vnc, or sometimes Xorg, now crashes (SIGSEGV) when one attempts to change the desktop resolution. The backtrace points into the __lll_unlock_elision() function which is a part of glibc implementation of mutexes for CPUs with Hardware Lock Elision instructions.

This crash doesn't happen when I run the same container on a server with an older CPU (which doesn't have TSX in the first place), or if I try to reproduce the bug at home (where I have a Haswell, with TSX disabled by the new microcode).

So, all apparently points to a bug related to these extensions. Or does it?

The __lll_unlock_elision() function has this helpful comment in it:

  /* When the lock was free we're in a transaction.
     When you crash here you unlocked a free lock.  */

And indeed, there is some discussion of another crash in __lll_unlock_elision(), related to NVidia driver (which is not used here). In that discussion, it was highlighted that an unlock of already-unlocked mutex would be silently ignored if a mutex implementation not optimized for TSX is used, but a CPU with TSX would expose such latent bug. Locking balance bugs are easily verified using valgrind. And indeed:

DISPLAY=:1 valgrind --tool=helgrind x11vnc
...
==4209== ---Thread-Announcement------------------------------------------
==4209== 
==4209== Thread #1 is the program's root thread
==4209== 
==4209== ----------------------------------------------------------------
==4209== 
==4209== Thread #1 unlocked a not-locked lock at 0x9CDA00
==4209==    at 0x4C326B4: ??? (in /usr/lib/valgrind/vgpreload_helgrind-amd64-linux.so)
==4209==    by 0x4556B2: ??? (in /usr/bin/x11vnc)
==4209==    by 0x45A35E: ??? (in /usr/bin/x11vnc)
==4209==    by 0x466646: ??? (in /usr/bin/x11vnc)
==4209==    by 0x410E30: ??? (in /usr/bin/x11vnc)
==4209==    by 0x717D82F: (below main) (libc-start.c:291)
==4209==  Lock at 0x9CDA00 was first observed
==4209==    at 0x4C360BA: pthread_mutex_init (in /usr/lib/valgrind/vgpreload_helgrind-amd64-linux.so)
==4209==    by 0x40FECC: ??? (in /usr/bin/x11vnc)
==4209==    by 0x717D82F: (below main) (libc-start.c:291)
==4209==  Address 0x9cda00 is in the BSS segment of /usr/bin/x11vnc
==4209== 
==4209== 

It is a software bug, not a CPU bug. But still - until such bugs are eliminated from the distribution, I'd rather not use it on a server with a CPU with TSX.

Sunday, May 8, 2016

Root filesystem snapshots and kernel upgrades

On my laptop (which is running Arch), I decided to have periodic snapshots of the filesystem, in order to revert bad upgrades (especially those involving a large and unknown set of interdependent packages) easily. My toolset for this task is LVM2 and Snapper. Yes, I know that LVM2 is kind-of discouraged, and Snapper also supports btrfs, but most of the points below apply to btrfs, too.

Snapper, when used with LVM2, requires not just LVM2, but thinly-provisioned LVM2 volumes. Fortunately, Arch can have root filesystem on such volumes, so this is not a problem.

So, I have /boot on /dev/sda1, LVM on LUKS on /dev/sda2, root on a thinly-provisioned logical volume, and /home on another thinly-provisioned volume. And also swap on a non-thinly-provisioned volume. A separate /boot partition is needed because boot loaders generally don't understand thinly-provisioned LVM volumes, especially on encrypted disks. A separate volume for /home is needed because I don't want my new files in /home to be lost if I revert the system to its old snapshot. The same need to make a separate volume applies to other directories that contain data that should be preserved, but there are no such directories on my laptop. They can appear if I install e.g PostgreSQL.

And now there is a problem. Rollback to a snapshot works, but only if there were no kernel updates between the time when the snapshot was taken and when an attempt to revert was made. The root cause is that the kernel image is in /boot, and loadable modules for it are in /usr/lib/modules. The modules are reverted, but the boot loader still loads a new kernel, which now has no corresponding modules.

There are two solutions: either revert the kernel and its initramfs, too, when reverting the root file system, or make sure that modules are not reverted. I have not investigated how to make the first option possible, even though it would be a perfect solution. However, I have tried to make sure that modules are not reverted, and I am not satisfied with the result.

The idea was to move modules to /boot/modules, and make this location available somehow as /usr/lib/modules. Here "somehow" can mean either a symlink, or a bind mount. A symlink doesn't work, because the kernel upgrade in Arch will restore it back to a directory. A bind mount doesn't work, either. The issue is that, by putting modules on non-root filesystem, one creates a circular dependency between local filesystem mounting and udev (this would apply to a symlink, too).

Indeed, systemd-udevd, on startup, maps the /usr/lib/modules/`uname -r`/modules.alias.bin file into memory. So, now it has a (real) dependency on /usr/lib/modules being mounted. However, mounting local filesystems from /etc/fstab sometimes depends on systemd-udevd, because of device nodes. So, bind-mounting /usr/lib/modules merely from /etc/fstab, using built-in systemd tools, cannot work.

But it can work from a wrapper that starts before the real init:

#!/bin/sh
mount -n /boot              # /dev/sda1 is in devtmpfs and doesn't need udev
mount -n /usr/lib/modules   # there is still a line in fstab about that
exec /sbin/init "$@" 

But that's ugly. In the end, I removed the wrapper, installed an old known-working "linux" package, made a copy of the kernel, its initramfs and modules, upgraded the kernel again, and put the saved files back, so that they are now not controlled by the package manager. So now I have a known good kernel down in the boot menu, and knowledge that its modules will always be present in my root filesystem if I don't revert further than up to today's state.

And now one final remark. Remember that I said: "The same need to make a separate volume applies to other directories that contain data that should be preserved"? There is a temptation to apply this to the whole /var directory, but that would be wrong. If a system is being reverted to its old snapshot, a package database (which is in /var/lib/pacman) should be reverted, too. But /var/lib/pacman is under /var.

The conclusion is that Linux plumbers should think a bit about this "revert the whole system" use case, and maybe move some directories.