Sunday, September 6, 2009

Matching multiple strings

Sometimes, it is necessary to check efficiently which of the many input strings contain one of the few pre-defined substrings.

The simplest approach



This task, of course, can be solved inefficiently by brute force, i.e. by passing each substring to strstr(), as demonstrated in the following benchmark that tries to search for known names of linux games in the same string (that actually doesn't match) 100000 times:


/* compile as follows: gcc -o benchmark1 benchmark1.c */
#include <stdio.h>
#include <string.h>
#include <sys/time.h>

const char* words[] = { "actioncube", "alien arena", "astromenace",
"armagetron", "assault cube", "atomorun2008", "battle for wesnoth",
"blob wars", "breakout", "briquolo", "bzflag", "celetania",
"conquest", "enigma", "freeciv", "freecol", "freeorion",
"freetrain", "glest", "hedgewars", "heretic", "hexen",
"jardinanis", "lordsawar", "lxdream", "maniadrive", "neverball",
"nexuiz", "oolite", "openarena", "opencity", "openmw",
"open quartz", "openttd", "paintball 2", "planeshift", "postal 2",
"sacred", "scorched 3d", "smokin' guns", "supertux", "supertuxkart",
"teeworlds", "tremulous", "urban terror", "vegastrike", "warsow",
"wormux", "x-moto" };

const char* text = "gnome uses gconf to store all of its configuration";

int check(const char* p)
{
int i;
for (i = 0; i < sizeof(words)/sizeof(words[0]); i++) {
if (strstr(p, words[i]) != NULL)
return 0;
}
return -1;
}

int main(int argc, char* argv[])
{
struct timeval start, end;
int usec_spent;
int i;
int result;

gettimeofday(&start, NULL);
for (i = 0; i < 100000; i++)
result = check(text);

gettimeofday(&end, NULL);
if (result == 0)
printf("The string matched\n");
else
printf("The string didn't match\n");
usec_spent = 1000000 * (end.tv_sec - start.tv_sec);
usec_spent += (end.tv_usec - start.tv_usec);
printf("Spent %d ms\n", usec_spent / 1000);
return 0;
}


On my desktop, this benchmark takes 2925 ms.

Aho-Corasick algorithm



Special fast algorithms exist for the same task. All of them preprocess the given set of substrings and use the result of such preprocessing when deciding whether the input string matches. One of such ahorithms is the Aho-Corasick algorithm. A free (BSD-licensed) implementation is available for download. Unfortunately, this implementation operates on wide-character strings and thus is not directly comparable with the strstr() approach. Also, if one of the substrings matches the very beginning of the input string, one cannot distinguish this situation with CSuffixTrie::SearchAhoCorasik() from non-match by looking only atCSuffixTrie::_DataFound::iFoundPosition. To compensate for this, I have extracted the SuffixTrie.{h,cpp} files from the archive and applied the following edits:


--- SuffixTrie.h
+++ SuffixTrie.h
@@ -51,7 +51,7 @@
{
public:
//Our string type
- typedef std::wstring SearchString;
+ typedef std::string SearchString;

//Data returned from our search
typedef struct _DataFound
@@ -107,7 +107,7 @@
virtual ~CSuffixTrie();
private:
//Our char search type
- typedef wchar_t SearchChar;
+ typedef char SearchChar;

//Forward declare the node
struct _Node;
--- SuffixTrie.cpp 2008-09-24 02:20:46.000000000 +0600
+++ SuffixTrie.cpp 2009-09-06 15:22:11.000000000 +0600
@@ -1,4 +1,3 @@
-#include "stdafx.h"
#include "SuffixTrie.h"

CSuffixTrie::CSuffixTrie()
@@ -136,7 +135,7 @@
void CSuffixTrie::BuildTreeIndex()
{
//Build index on the root
- BuildIndex(L"",
+ BuildIndex("",
&m_aRoot);
}

@@ -261,7 +260,7 @@
{
//Our data found
DataFound aData;
- aData.iFoundPosition=0;
+ aData.iFoundPosition=-1;

//The current string we match
SearchString sMatchedString;
@@ -295,7 +294,7 @@
pNode=&m_aRoot;

//Reset search string
- sMatchedString=L"";
+ sMatchedString="";

//Did we do a switch?
if (bSwitch)
@@ -386,7 +385,7 @@
pNode=&m_aRoot;

//Reset search string
- sMatchedString=L"";
+ sMatchedString="";

//Did we do a switch?
if (bSwitch)
@@ -439,7 +438,7 @@
iCount-=sMatchedString.length()-1;

//Reset the data
- sMatchedString=L"";
+ sMatchedString="";
}
}

@@ -495,7 +494,7 @@

//Start to build the trie
BuildStrings(aVector,
- L"",
+ "",
&m_aRoot);

//Done


Here is a benchmark:


/* compile as follows: g++ -O2 -o benchmark2 benchmark2.cpp SuffixTrie.cpp */
#include "SuffixTrie.h"
#include <stdio.h>
#include <string>
#include <sys/time.h>

const char* words[] = { "actioncube", "alien arena", "astromenace",
"armagetron", "assault cube", "atomorun2008", "battle for wesnoth",
"blob wars", "breakout", "briquolo", "bzflag", "celetania",
"conquest", "enigma", "freeciv", "freecol", "freeorion",
"freetrain", "glest", "hedgewars", "heretic", "hexen",
"jardinanis", "lordsawar", "lxdream", "maniadrive", "neverball",
"nexuiz", "oolite", "openarena", "opencity", "openmw",
"open quartz", "openttd", "paintball 2", "planeshift", "postal 2",
"sacred", "scorched 3d", "smokin' guns", "supertux", "supertuxkart",
"teeworlds", "tremulous", "urban terror", "vegastrike", "warsow",
"wormux", "x-moto" };

const char* text = "gnome uses gconf to store all of its configuration";

int main(int argc, char* argv[])
{
struct timeval start, end;
int usec_spent;

CSuffixTrie aTree;
int i;
for (i = 0; i < sizeof(words)/sizeof(words[0]); i++)
aTree.AddString(words[i]);

aTree.BuildTreeIndex();

std::string stext = text;
int pos;

gettimeofday(&start, NULL);
for (i = 0; i < 100000; i++) {
pos = aTree.SearchAhoCorasik(stext).iFoundPosition;
}
gettimeofday(&end, NULL);

if (pos == -1)
printf("The string didn't match\n");
else
printf("The string matched\n");

usec_spent = 1000000 * (end.tv_sec - start.tv_sec);
usec_spent += (end.tv_usec - start.tv_usec);
printf("Spent %d ms\n", usec_spent / 1000);
return 0;
}



On my desktop, this benchmark takes 326 ms. I.e., this is indeed faster than the simple strstr()-based approach.

Regular expressions



One may wonder why the Aho-Corasick algorithm is not available in the standard C library. The answer is that a more general mechanism, based on a similar idea, is available: regular expressions. One first constructs a regular expression by escaping and joining the given substrings using "|" as the separator, then compiles this regular expression with regcomp() and uses the result in regexec(). Here is a benchmark:


/* compile as follows: gcc -o benchmark3 benchmark3.c */
#include <stdio.h>
#include <regex.h>
#include <string.h>
#include <sys/time.h>

const char* re = "(actioncube|alien arena|astromenace|armagetron|"
"assault cube|atomorun2008|battle for wesnoth|blob wars|breakout|"
"briquolo|bzflag|celetania|conquest|enigma|freeciv|freecol|"
"freeorion|freetrain|glest|hedgewars|heretic|hexen|jardinanis|"
"lordsawar|lxdream|maniadrive|neverball|nexuiz|oolite|openarena|"
"opencity|openmw|open quartz|openttd|paintball 2|planeshift|"
"postal 2|sacred|scorched 3d|smokin' guns|supertux|supertuxkart|"
"teeworlds|tremulous|urban terror|vegastrike|warsow|wormux|x-moto)";

const char* text = "gnome uses gconf to store all of its configuration";

int main(int argc, char* argv[])
{
struct timeval start, end;
int usec_spent;
int i;
regex_t reg;
int result;

result = regcomp(&reg, re, REG_EXTENDED | REG_NOSUB);
if (result)
return 1;

gettimeofday(&start, NULL);
for (i = 0; i < 100000; i++)
result = regexec(&reg, text, 0, NULL, 0);
gettimeofday(&end, NULL);
if (result == 0)
printf("The string matched\n");
else
printf("The string didn't match\n");
usec_spent = 1000000 * (end.tv_sec - start.tv_sec);
usec_spent += (end.tv_usec - start.tv_usec);
printf("Spent %d ms\n", usec_spent / 1000);
return 0;
}


On this linux x86 system with glibc-2.10.1, it takes 312 ms, i.e., as good as the Aho-Corasick implementation referenced above.

On other platforms, the result may be different, because the system C library uses a different regular expression engine. E.g., the same benchmark, when compiled with TRE (replace <regex.h> with <tre/regex.h> and add -ltre to the gcc command line), takes 4158 ms, i.e., more than the brute-force approach. Since NetBSD will use TRE in the next release, this unsatisfactory result will likely manifest itself there. So, if you want to give the same performance guarantees on all platforms, don't use regcomp()/regexec().

The same note about unsatisfactory speed (8884 ms here) applies to PCRE. To repeat this test, replace <regex.h> with <pcreposix.h> and add -lpcreposix to gcc command line.

Generated code



It is possible to match multiple substrings much faster, if they are known at compile time. The trick is to convert the regular expression to C code with a tool such as re2c. Here is a benchmark of this approach:


/* compile as follows:
re2c benchmark4.re >benchmark4.c
gcc -O1 -o benchmark4 benchmark4.c */
#include <stdio.h>
#include <sys/time.h>

int check(const char* p)
{
const unsigned char* marker;
/*!re2c
re2c:define:YYCTYPE = "unsigned char";
re2c:define:YYCURSOR = p;
re2c:yyfill:enable = 0;
re2c:define:YYMARKER = marker;
re2c:yych:conversion = 1;
[\001-\377]*( "actioncube" | "alien arena" |
"astromenace" | "armagetron" |
"assault cube" | "atomorun2008" |
"battle for wesnoth" | "blob wars" |
"breakout" | "briquolo" | "bzflag" |
"celetania" | "conquest" | "enigma" |
"freeciv" | "freecol" | "freeorion" |
"freetrain" | "glest" | "hedgewars" |
"heretic" | "hexen" | "jardinanis" |
"lordsawar" | "lxdream" | "maniadrive" |
"neverball" | "nexuiz" | "oolite" |
"openarena" | "opencity" | "openmw" |
"open quartz" | "openttd" | "paintball 2" |
"planeshift" | "postal 2" | "sacred" |
"scorched 3d" | "smokin' guns" |
"supertux" | "supertuxkart" |
"teeworlds" | "tremulous" |
"urban terror" | "vegastrike" |
"warsow" | "wormux" |
"x-moto" ) { return 0; }
"" { return 1; }
*/
}

const char* text = "gnome uses gconf to store all of its configuration";

int main(int argc, char* argv[])
{
struct timeval start, end;
int usec_spent;
int i;
int result;

gettimeofday(&start, NULL);
for (i = 0; i < 100000; i++)
result = check(text);
gettimeofday(&end, NULL);
if (result == 0)
printf("The string matched\n");
else
printf("The string didn't match\n");
usec_spent = 1000000 * (end.tv_sec - start.tv_sec);
usec_spent += (end.tv_usec - start.tv_usec);
printf("Spent %d ms\n", usec_spent / 1000);
return 0;
}


When compiled with -O1 or -O3, this benchmark runs 12 ms here (i.e., almost 250 times faster than the brute-force approach). However, the -O2 optimization option makes the code slower (it runs 16 ms).

Saturday, July 25, 2009

Dangers of setjmp()/longjmp()

Suppose that you want to use an external library (e.g., libpng or libjpeg) that, in terms of error handling, only gives you a choice between an approach based on setjmp()/longjmp(), or aborting the program. Of course, in robust programs, the first method should be used. But there are pitfalls inherently associated with using setjmp(). Let me illustrate one of them using ImageMagick as an example.

Once ImageMagick determines the dimensions of a PNG image that it reads, it allocates memory to store the pixels. The task is to free it if the image file is in fact broken. But the input image can be so broken that libpng calls longjmp() before telling the dimensions of the image. In this case, ImageMagick allocates nothing and thus has nothing to free. In short, the error path has to free memory if and only if it has been allocated before.

Sounds simple? Let's look at the code that seems to implement this logic:


static Image *ReadOnePNGImage(MngInfo *mng_info,
const ImageInfo *image_info, ExceptionInfo *exception)
{
Image
*image;

png_struct
*ping;

unsigned char
*png_pixels;

/* set up ping, do other things */

png_pixels=(unsigned char *) NULL;
if (setjmp(ping->jmpbuf))
{
/*
PNG image is corrupt.
*/
/* ... */
if (png_pixels != (unsigned char *) NULL)
png_pixels=(unsigned char *) RelinquishMagickMemory(png_pixels);

/* ... */
return(GetFirstImageInList(image));
}

/* read png dimensions and other information - might call longjmp() */

if (num_passes > 1)
png_pixels=(unsigned char *) AcquireQuantumMemory(image->rows,
ping_info->rowbytes*sizeof(*png_pixels));
else
png_pixels=(unsigned char *) AcquireQuantumMemory(ping_info->rowbytes,
sizeof(*png_pixels));

/* read the scanlines - might call longjmp() */
/* do other useful things */
return (image);
}

At first, it seems that the code does exactly what is stated above: the bold statement frees the png_pixels pointer only if the memory has been allocated. But actually it doesn't work: the memory leaks if a broken PNG file is attempted to be processed. The library seems to forget after returning to setjmp() that a non-NULL value has been assigned to png_pixels. If one inserts printf() calls for debugging, they will print a non-NULL value after AcquireQuantumMemory(), but a NULL before the NULL check. So the memory is allocated and then not freed.

The explanation is that, after longjmp(), according to the ISO C standard (7.13.2.1, "The longjmp function"),

All accessible objects have values as of the time longjmp was called, except that the values of objects of automatic storage duration that are local to the function containing the invocation of the corresponding setjmp macro that do not have volatile-qualified type and have been changed between the setjmp invocation and longjmp call are indeterminate.


Indeed, the png_pixels variable is automatic, not volatile, and is changed between the calls to setjmp() and longjmp(), and thus the standard allows it to forget its value. GCC sometimes (but not always!) warns about such cases.

The simplest fix is usually to declare the variable as volatile. As for ImageMagick, they opted to fix the problem in a different way (but, if I read it correctly - with some dead code, and not for all variables), and version 6.5.4-6 is supposed to have no such memory leak.

Sunday, June 28, 2009

New TV Set

Recently I bougth a new TV set - a Sony KDL-32W5500 LCD. It replaced the old Soviet-era CRT-based "Akari" TV.

The new TV set receives both analog (still the norm in Russia) and digital stations. It contains an AVC tuner and thus doesn't need a set-top box or a separate decoder to watch the digital broadcasts (Russia, unlike the rest of the world, uses MPEG-4 Part 10, aka H.264, for them).

The default color settings are good, unlike those in the old TV, which gave oversaturated colors by default. The new TV set has a 1920x1080 panel, but I found that, for my eyes, 1080i and 720p (from the computer monitor) look nearly the same. And there is not a lot of HD content in Yekaterinburg anyway.

The default option for aspect ratio, however, is to stretch 4:3 content to 16:9 non-uniformly. The options to preserve the original 4:3 aspect and to stretch the letterboxed content uniformly do exist. I wish it could autodetect the letterbox - this would mean one less button to use.

While digital broadcasts are displayed perfectly, the analog tuner is not so good as in the old TV. Stations that were displayed with "snow" by the old TV now are not watchable at all due to either flashing color stripes or the "no signal" message.

As far as the inputs go, the Sony TV has 4 HDMI sockets, one VGA, one USB and one Ethernet connection. Using a DVI -> HDMI cable and a separate analog audio cable, this TV can be connected to my computer's video card and sound card. The intel driver had no problems recognizing it as a 1920x1080@60p panel.

However, as the primary TV show consumer is my mother, this is still not a very viable solution. I want to work while she watches films. So, I tried using the built-in media playback capabilities of this TV set.

First, it can play files from USB flash drives. The instruction says it accepts only MPEG-1, but de-facto, it accepted non-interlaced MPEG-2, too. It could not play a DVD rip from the flash drive. So, using this option means transcoding everything - not very good.

Second, it can act as a UPnP Renderer. The instruction says that the TV accepts MPEG-1, MPEG-2 and AVCHD formats. So, I thought I could set up a media server on my computer. I have tried MediaTomb and MiniDLNA. MediaTomb did not work at all (the TV displayed the "server is unsupported" message), MiniDLNA worked somewhat. I was able to send MPEG-1 and MPEG-2 files (including the DVD rip that failed to play from the flash drive) to the TV. I could not remux an available H.264 + AC3 Matroska file to MPEG-TS in such a way that the TV understands it. So, for typical torrented video files, this still means transcoding. So, I must conclude that the media playback capabilities of this TV are there only for the marketing department to be able to say that they exist.

So, I am a bit dissatisfied with this offer from Sony.

Sunday, March 8, 2009

Don't use old dtoa.c

A lot of projects use the dtoa.c file that bears the following copyright notice:

 * The author of this software is David M. Gay.
*
* Copyright (c) 1991, 2000, 2001 by Lucent Technologies.


If your software uses it, please take steps to update this file or stop using it. The reason is very simple: your program will compile just fine, but will work incorrectly when compiled with gcc-4.4.0 (not released yet). Here is a short testcase (save as test.c):

#include <stdio.h>

double strtod(const char *s00, char **se);

int main(int argc, char* argv[])
{
double result = strtod(argv[1], 0);
printf("%f\n", result);
return 0;
}


Compile as follows: gcc-4.4 -O2 -Dstrtod=my_strtod -DIEEE_8087=1 -DHonor_FLT_ROUNDS=1 -DTrust_FLT_ROUNDS -DOmit_Private_Memory=1 -DNO_INFNAN_CHECK=1 -DNO_HEX_FP=1 dtoa.c test.c

The -Dstrtod=my_strtod flag is needed in order to make sure that the version of strtod() from dtoa.c, not from your system C library, is used. -O2 is the default optimization level used by many software projects. The rest of the flags are explained in the dtoa.c file.

The testcase converts its argument to a double-precision floating-point number, and then prints the result. So, when given a number, this testcase should print it back. let's try:

$ ./a.out 1.1
11.000000


The result is clearly incorrect. The reason (as one can see by appending -Wall to the compiler command line) is that strict aliasing rules are violated in dtoa.c. Indeed, if one adds the -fno-strict-aliasing flag to the compiler command line, the resulting program will behave correctly:

$ ./a.out 1.1
1.100000


Here is the formulation of the aliasing rules from ISO/IEC 9899:TC2, section 6.5:


An object shall have its stored value accessed only by an lvalue expression that has one of the following types:


  • a type compatible with the effective type of the object,

  • a qualified version of a type compatible with the effective type of the object,

  • a type that is the signed or unsigned type corresponding to the effective type of the object,

  • a type that is the signed or unsigned type corresponding to a qualified version of the effective type of the object,

  • an aggregate or union type that includes one of the aforementioned types among its members (including, recursively, a member of a subaggregate or contained union), or

  • a character type.



dtoa.c does this:

typedef union { double d; ULong L[2]; } U;

#define word0(x) ((U*)&x)->L[1]
#define word1(x) ((U*)&x)->L[0]
#define dval(x) ((U*)&x)->d


So, it stores doubles, but reads this memory using lvalue expressions of the "ULong" type (and the other way round), contrary to the standard.

GCC allows such access only if the memory is accessed through a union type. The TC3 revision of the C99 standard also allows type punning through the union in the footnote in the section 6.5.2.3:

If the member used to access the contents of a union object is not the same as the member last used to store a value in the object, the appropriate part of the object representation of the value is reinterpreted as an object representation in the new type as described in 6.2.6 (a process sometimes called "type punning".


In dtoa.c, the union is "invented" each time the code wants to reinterpret a double value as two ULongs. So, it doesn't make sense to speak about the member last used to store a value in the union, and the note above doesn't apply. I.e., instead of this code,

double a;
word0(a) = L;
word1(a) = 0;
return dval(a);


this should be used:

U a;
a.L[0] = L;
a.L[1] = 0;
return a.d;


Note that there are no pointer casts in the corrected code.

The topic of strict aliasing rules, with working and non-working examples, is explained in the GCC manual. An updated version of dtoa.c and the needed header are available in the sources of GCC itself.

Saturday, January 31, 2009

Package dependencies

Let's start with this fact: linphone uses the ffmpeg library to encode video. How can various package managers be informed about this fact by maintainers while building the linphone package?

In Arch Linux, the dependency information is expressed in the PKGBUILD file in a very straightforward way:

depends=('alsa-lib' 'ffmpeg>=20080715' 'gtk2' 'libexosip2>=3.1.0' 'speex>=1.1.12')


And this propagates in the same form into the binary package of linphone. But, is it really what is needed to prevent installation of a broken package? No.

The linphone binary from the package uses the libavcodec.so.51 library, not just "any ffmpeg >=20080715". So the linphone package will break when the ffmpeg package starts providing libavcodec.so.52 (and this already happened in the testing repository). Let's see how other package managers deal with the situation.

RPM-based distributions have to add this line to the linphone.spec script:

BuildRequires:  ffmpeg-dev


Then, RPM will build the linphone package and examine the resulting binaries. It will notice that the linphone binary uses the libavcodec.so.51 library and will add the dependency on that library (not on some abstract "ffmpeg" package). This way, if the new ffmpeg package no longer includes libavcodec.so.51, RPM will notice that the linphone package now has an unsatisfied dependency, print an error message and refuse to upgrade ffmpeg.

This is better than the Arch Linux way, but requires some external tool to know that libavcodec.so.51 is actually in the ffmpeg package. Otherwise, how would the "install linphone with all dependencies" request be handled?

Now let's talk about Debian and its derivatives. They also have a notion of a build-time dependency, and it is expressed in the debian/control file:

Build-Depends: ..., libavcodec-dev, ...


When the package is built, debhelper also notices that the resulting linphone binary depends on libavcodec.so.51, and searches the installed packages for this library. As a result, the dependency on the found package is added automatically to the binary package of linphone. This looks similar to the Arch Linux case, but Debian has a policy that requires the shared library version to be mentioned in the package name. I.e., linphone gets a dependency on "libavcodec51" package, while "libavcodec52" is a completely different package that (assuming no bugs) can be installed in parallel with libavcodec51, thus allowing the rebuilds of other packages for the ffmpeg upgrade to take place gradually, without any hurry and without the broken intermediate state.

Tuesday, January 6, 2009

Inefficient use of Valgrind

This is already stated in the documentation, but let me repeat. Valgrind can detect memory leaks only if the application uses malloc() and free(), or, in C++, operator new() and operator delete().

This is not always the case. E.g., ImageMagick, when configured with the --enable-embeddable switch, uses mmap() directly in order to allocate memory for its needs. Other problematic cases involve custom memory pools on top of malloc() and free(), as done in Glib.

So, if you want to find a memory leak with Valgrind, the first thing to do is to turn off all those custom memory allocation schemes. OTOH, if the buggy thing is the allocator itself, this will hide the bug.