Using ICU for collation of multibyte charsets in PostgreSQL

The Problem

FreeBSD currently does not properly collate (aka sort) when using a locale with a UTF-8 encoded character set. Hence, with a "UNICODE" initialized database cluster, ORDER BY and upper(), lower() & initcap() will not work as expected in all cases.

The Solution

I've put together a patch for PostgreSQL that uses IBM's ICU package, http://icu.sf.net/ for collation instead of the OS built in. It currently only tested on FreeBSD, and I have yet not made any performance checks, but it works fine for me. :) YMMV...

The patch is incorporated into the FreeBSD port databases/postgresql80-server. Just make config and check the "ICU" box.

2005-04-16: Fix problem with equality in some odd cases. Sometimes two equal strings would be considered not equal. I've changed the ucol_strcoll() to expect null terminated strings, and this fixes the problem. Please update to the latest version ASAP if you used a previous version of this patch!

2005-05-06: Some really strange problems, due to mystic reluctance to use static pointers in some cases, has been fixed. You would really want to upgrade to this version. Also, add error handling in all ICU calls, which might help in debugging.

2006-09-20: Update the patch to be more agnostic with regards to the version of icu. It now links with icu-3.6 as well as the earlier versions.

2006-09-25: Fix problem in previous patch. It now links and configures correctly with icu-3.6 as well as the earlier version 3.4.

2008-01-08: Chase autoconf versions in FreeBSD ports

2008-02-19: Chase icu update to 3.8. It should work with 3.4, 3.6 and 3.8 now. So far, only for postgresql-8.1.

2008-03-06: Petr Jelinek has ported the patch to postgresql-8.3. Thanks a lot, Petr! It is committed to the FreeBSD port now. I also modified it so it seems it is 25% faster for utf-8, by using ICU's strcollIter, saving us the work to convert to UTF-16 that is otherwise needed.

2008-03-17: Update the patch for postgresql-8.3:
In some locales wcscoll() can claim that nonidentical strings are equal. Believing that this might be so also for ICU, and believing that would be bad news for a number of reasons, we follow Perl's lead and sort "equal" strings according to strcmp (on the byte representation).
2008-08-28: Chase update of autoconf to 2.62
Minimal update to the patches in order to handle revision bump of autoconf, from 2.61 to 2.62.

I recommend you to use the FreeBSD port, but if you rather build postgresql yourself, you can download the latest version.


Palle Girgensohn, girgen@FreeBSD.org
Last modified: Thu Aug 28 15:03:38 UTC 2008