Friday, November 12, 2010

Minimizing server downtime

Suppose that you are responsible for a typical LAMP server running Debian. The goal is to have the needed security updates and minimize downtime. Unfortunately, due to installation of security updates, services need to be restarted, and sometimes (after kernel updates) even a reboot is required. Here are some tips that will help you shrink the time of service unavailability.

  1. Upgrade your key services (apache, php and mysql) separately from the rest of software. Reason: if you just run "apt-get dist-upgrade" to upgrade everything, then your MySQL server will be stopped at the beginning of the whole upgrade and started only at the end, after unpacking all other packages. So, run "apt-get install mysql-server" before "apt-get dist-upgrade" to minimize the delay.
  2. If you know that kexec works on your hardware, use it: apt-get install kexec-tools. This way, you will avoid the long time (sometimes more than a minute) while your server shows useless BIOS screens. Instead of rebooting to the BIOS, the old kernel will load the new kernel directly and pass control to it, thus avoiding the delay.

Saturday, October 23, 2010

DNS servers in resolv.conf

Yesterday I found a problem with my Linode VPS. When running "aptitude update", it complained about not being able to resolve the name for security.debian.org (a server from where Debian distributes security updates).

Here is the /etc/resolv.conf file from that VPS, generated by dhclient:

domain members.linode.com
search members.linode.com
nameserver 109.74.194.20
nameserver 109.74.192.20
nameserver 109.74.193.20

It looks like the 109.74.194.20 server is overloaded and returns a SERVFAIL ("server failure - The name server was unable to process this query due to a problem with the name server") answer from time to time. Unfortunately, Debian has a bug in its version of libc that prevents the resolver from trying the other (working) servers if the first one returns an answer indicating a temporary failure.

The bug does not exist in glibc-2.12.1 on my Gentoo box at home. However, if the server does not respond at all, the Gentoo resolver spends 15 seconds before trying the next one, which is also not nice.

To work around the bug, you can run your own DNS server on 127.0.0.1 that forwards all queries to the official servers, and put 127.0.0.1 as the only nameserver in /etc/resolv.conf.

Thursday, August 5, 2010

Floating-point fields in MySQL

It is a well-known fact that floating-point values should not be compared for equality, due to rounding problems. MySQL puts an interesting twist to it, as demonstrated on the linux.org.ru forum. To see the strange effect, run the following SQL statements:

mysql> create table test (field float);
mysql> insert into test(field) values(0.3);
mysql> select * from test;
+-------+
| field |
+-------+
|   0.3 |
+-------+
1 row in set (0.00 sec)

mysql> select * from test where field = 0.3;
Empty set (0.00 sec)

I.e., it looks like MySQL successfully stores the value, but cannot find it later.

To see why it happens, recall that 0.3 cannot be represented exactly in binary form. It is a recurring binary fraction 0b0.0100110011... that has to be truncated somewhere when stored by a computer. When stored as a float data type in the table, 24 significant binary digits are retained. However, when 0.3 is mentioned directly in the query, it is treated as a double-precision number, and 53 significant bits are kept. The truncation thus occurs in different places, and the resulting numbers are just different:

mysql> select cast(field as decimal(15,15)) from test;
+-------------------------------+
| cast(field as decimal(15,15)) |
+-------------------------------+
|             0.300000011920929 |
+-------------------------------+
1 row in set (0.00 sec)

mysql> select cast(0.3 as decimal(15,15));
+-----------------------------+
| cast(0.3 as decimal(15,15)) |
+-----------------------------+
|           0.300000000000000 |
+-----------------------------+
1 row in set (0.00 sec)

MySQL documentation contains more information on problems with floating-point values.

Saturday, May 29, 2010

When linear search is faster than binary

It is often stated that a binary search algorithm performs better than linear search. Indeed, let's assume that the goal is to find the index of the largest integer in the sorted zero-based array A that is less than or equal to the given integer X. For simplicity, let's also assume that the size of array is N = 2K, that for all possible inputs A[0] <= X and that one can also put a sentinel value after the end, so that X < A[N] for all possible inputs X. E.g., if it is known that X always fits in 16 bits, one can put 65536 as the sentinel.

The assumptions above are true for the entropy decoder found in the Monkey's Audio codec. In that case, N = 64, X is unsigned and always fits in 16 bits, A[0] = 0, A[64] = 65536.

This is linear search:


A[N] = sentinel;
index = 0;
while (A[index + 1] <= X)
    ++index;


This is binary search:


int bit = N >> 1;  /* N is a power of two */
index = 0;
while (bit != 0) {
    if (A[index + bit] <= X)
        index += bit;
    bit >>= 1;
}


If the size of array to be searched is N, then the average case complexity is commonly stated to be O(log(N)) for binary search and O(N) for linear search. So, for the Monkey's Audio case presented above, one would expect, on average, 32 loop iterations for the linear search algorithm and 6 iterations for the binary search.

However, there is one key fact that makes the estimate above invalid. Such simple averaging is valid only if each result appears with the same probability. For Monkey's Audio entropy decoder, this is not the case.

In fact, out of the 64 possible results, "0" appears with 31% probability, and the total probability of the first 6 results is 87%. So, it is quite natural that linear search is faster than binary in this case, because it often stops very early.

Monday, May 10, 2010

Converting audio samples

Different audio applications process audio in different sample formats. Some use 16-bit samples internally, others use 32-bit or floating point samples. In either case, there is a need to convert the sample format for the purposes of export and playback. Sounds like a simple task? No, a rich field for disagreements and bugs, even without taking dithering into account.

E.g., when converting from floating-point samples to 16 bits, there is no universal agreement on the scale. Some (e.g., JACK) think that 1.0 should convert to 32767 and -1.0 to -32767. Within that model, there are two opinions how to get -32768 (the most negative integer representable in 16 bits): "-1.00003" or "you can't". The other viewpoint, implemented, e.g., in ALSA-lib, is that -1.0 should map to -32768, 0.99997 should give 32767 and 1.0 should clip. There is also a third possibility, namely, to use different scale factors for negative and positive sample values, so that 0.0 maps to 0, 1.0 maps to 32767 and -1.0 maps to -32768.

And now some code. If one adopts the 32767.0 scale factor (as in JACK) and declares the -32768 sample value as being unreachable, one can use the following C function for conversion:


int16_t to_int16(float f)
{
if (f > 1.0) f = 1.0;
if (f < -1.0) f = -1.0;
return (int16_t)(f * 0x7fff);
}


As expected, to_int16(1.0) == 32767 and to_int16(-1.0) == -32767. However, the following quite similar function that is supposed to convert floating-point samples to 32-bit signed ones doesn't actually work properly:


int32_t to_int32(float f)
{
if (f > 1.0) f = 1.0;
if (f < -1.0) f = -1.0;
return (int32_t)(f * 0x7fffffff);
}


Indeed, to_int32(1.0) returns the minimum, not the maximum possible 32-bit integer. That happens because, unlike 0x7fff, one cannot represent 0x7fffffff exactly as a floating-point number. So the nearest representable one is substituted, and that's 0x80000000. So the overflow happens, and the function returns a wrong result.

One fix would be to cast 0x7fffffff to a double-precision floating point number. Another (but not 100% equivalent) option is to change the scale factor so that it is close enough, but doesn't cause this overflow, i.e. to 0x7fffff00.

New job

This blog didn't receive any updates for quite a while. That's both because I was too busy at Yandex and because there were no interesting bugs to blog about. I no longer work at Yandex. There was no single big reason for this decision, just a lot of small issues that accumulated and that don't apply to anyone else.

Now I work for a different company (won't name it here, it is nether SEO nor Google) as a senior developer of a web service. The most pleasant result of the job change is that now I feel more like a developer: write mostly C code, not bash scripts, and can even make architectural decisions.