Friday, November 25, 2011

Namespace issues in Apache

Some time ago I was given a task to write a Python WSGI script that handles all (really all!) URLs on a given Apache-based virtual host. Doesn't that sound easy? Here is the "obvious" part of the Apache configuration file for that virtual host:

WSGIScriptAlias / /path/to/script.wsgi

Except that it is slightly wrong. The problem is that the script really should handle all URLS, even "evil" ones, without any omissions. In the above form, it doesn't.

First, if the URL contains a percent-encoded slash (e.g., as in http://example.org/foo%2fbar), Apache gives a 404 error by default, without even calling the script. Solution:

AllowEncodedSlashes On

Wait, there is more! Regular aliases have higher priority than our WSGIScriptAlias, and that there are some default aliases usually set by distributions. Due to that, http://example.org/icons/folder.gif will map to a static file, not to the WSGI script. There is a wishlist bug in Apache that one cannot easily remove aliases from the namespace.

The solution (or, more precisely, a bad hack) that I found is to use mod_rewrite. By adding some prefix to all URLs (and, of course, dealing with it in the script), one makes sure that the existing aliases are not hit:

AllowEncodedSlashes On

RewriteEngine On
RewriteRule ^/(.*)$ /dummyprefix/$1 [PT]

WSGIScriptAlias /dummyprefix /path/to/script.wsgi

Still, this doesn't work (gives a 404 error instead of calling the script) on URLs that include some bad characters such as newlines, e.g. http://example.org/foo%0abar . Removing the dollar sign from the pattern fixes the 404 error, calls the script, but it then receives a truncated PATH_INFO in the environment. Still, SCRIPT_URI is correct, so this may be enough. In fact, if the script doesn't care about PATH_INFO, even this works:

AllowEncodedSlashes On

RewriteEngine On
RewriteRule ^/ /dummyprefix [PT]

WSGIScriptAlias /dummyprefix /path/to/script.wsgi

The real cause of the issue with the original RewriteRule is that the "." metacharacter doesn't really match all characters. Indeed, it doesn't match a newline. So, in order to match the full URL, including the evil encoded newline in the middle, one has to write an explicit range covering all possible characters:

AllowEncodedSlashes On

RewriteEngine On
RewriteRule ^/([\x00-\xff]*)$ /dummyprefix/$1 [PT]

WSGIScriptAlias /dummyprefix /path/to/script.wsgi

Whoops. Now a typical sysadmin probably won't understand the need (or will forget the reason) behind such a complex configuration for a seemingly simple issue of passing all URLs to a single script. And I am still not 100% sure that all valid URLs are really handled by the script. Maybe I should have started with a different web server, the one that doesn't have a polluted namespace in virtual hosts by default.

Update: In the comments, the following was suggested:

AllowEncodedSlashes On
WSGIHandlerScript wsgi-handler /path/to/handler.wsgi
SetHandler wsgi-handler

Here handler.wsgi would be a typical WSGI script, except that it provides the "handle_request" callable object instead of "application".

This solution looks elegant and simple, but, if PHP is also installed on the same server, it is wrong (thus proving the complexity of the problem and the fragility of Apache URL namespace). It misses URLs that end in .php (even though the document root s empty). So, back to the ugly solution based on mod_rewrite.