Sunday, April 29, 2012

Verifying the Corresponding Source

(This post is a troll, don't take it too seriously!)

GNU GPL and some other software licenses require you to make the source available if you publish the covered software in the binary form. At first, this seems to be a very simple requirement. However, theoretically, as I will show below, it can create problems for you even if you comply. What matters for this post is whether you can prove the compliance, and this turns out to be unexpectedly hard. Please consider this (theoretical so far) danger when deciding to publish your own work under GPL or similar licenses.

Let's take Arch Linux as an imaginary victim. They distribute a lot of software in the binary form, and some of the software is covered by GNU GPL. Let's take GNU Bison as a simple example. At the time of this writing, the version of Bison distributed by Arch Linux is 2.5-3. Here is the x86-64 package, here are the purported machine-readable build instructions and patches, and the source can be fetched from gnu.org. Now imagine that someone makes a (false in this case) claim that the source, build instructions and patches do not correspond to the published binary package. How can this claim be refuted?

The obvious idea would be to rebuild the binary package from sources and compare its contents to the published binary package. For simplicity, let's limit this comparison to the /usr/bin/bison binary. Unfortunately, this simple idea fails. The published binary is 377456 bytes long, while my attempt to rebuild it in an up-to-date Arch resulted in a different length:

$ makepkg
<lots of output snipped>
$ tar tvf bison-2.5-3-x86_64.pkg.tar.xz | grep usr/bin/bison
-rwxr-xr-x root/root    377488 2012-04-29 17:41 usr/bin/bison

This mismatch means that the content of the resulting binary depends not only on what is distributed as the Corresponding Source, and that something has changed.

Of course, one of the changed factors is the compiler version. Different compilers implement different optimizations, and thus generate different code. The original bison binary in the original package is dated by November 9th, 2011. So, to remove this factor, we need the same compiler as was available on that day. Fortunately, Arch has a Rollback Machine that has every version of every package back to 2008. So, the next attempt to produce an identical binary: download and install gcc from the same date. This downgrade fails to produce a working gcc, because the run-time dependencies of gcc have to be downgraded too. This means cloog, gcc-libs, isl and ppl. Surprise: the downgraded gcc still doesn't produce the bison binary identical to the official one!

The other factor is that not only object code from bison's *.o files goes into the resulting bison binary. The linker also inserts code from /usr/lib/libc_nonshared.a which belongs to glibc. So, in order to cancel this factor, one would need to downgrade glibc, which is impossible to do safely: on today's Arch Linux there is a lot of software, including Bash, that depends on glibc >= 2.15.

The solution is to install another Arch Linux system into an initially-empty chroot, using the Rollback Machine instead of a mirror. I.e. with this line in /etc/pacman.d/mirrorlist:

Server = http://arm.konnichi.com/2011/11/09/$repo/os/$arch

The system installed this way produced the bison binary identical to the official one (MD5 sums matched).

So, in the end, I was able to prove that Arch Linux indeed distributes the source that corresponds to their bison binary. However, I would not have succeeded if they didn't have their Rollback Machine — i.e. essentially a way to reconstruct the whole build environment, far beyond what GPL seemingly requires.