Using ICU for collation of multibyte charsets in PostgreSQL

The Problem

FreeBSD currently does not properly collate (aka sort) when using a locale with a UTF-8 encoded character set. Hence, with a "UNICODE" initialized database cluster, ORDER BY and upper(), lower() & initcap() will not work as expected in all cases.

The Solution

I've put together a patch for PostgreSQL that uses IBM's ICU package, http://site.icu-project.org for collation instead of the OS built in. It currently only tested on FreeBSD, and I have yet not made any performance checks, but it works fine for me. :) YMMV...

The patch is incorporated into the FreeBSD ports databases/postgresql*-server. Just make config and check the "ICU" box. For other OS:es, just use apply the patch to you source dir and add --with-icu to your configure command string.

The patch is maintained at https://github.com/girgen/postgres as branches prefixed with icu/.

2005-04-16
Fix problem with equality in some odd cases. Sometimes two equal strings would be considered not equal. I've changed the ucol_strcoll() to expect null terminated strings, and this fixes the problem. Please update to the latest version ASAP if you used a previous version of this patch!
2005-05-06
Some really strange problems, due to mystic reluctance to use static pointers in some cases, has been fixed. You would really want to upgrade to this version. Also, add error handling in all ICU calls, which might help in debugging.
2006-09-20
Update the patch to be more agnostic with regards to the version of icu. It now links with icu-3.6 as well as the earlier versions.
2006-09-25
Fix problem in previous patch. It now links and configures correctly with icu-3.6 as well as the earlier version 3.4.
2008-01-08
Chase autoconf versions in FreeBSD ports
2008-02-19
Chase icu update to 3.8. It should work with 3.4, 3.6 and 3.8 now. So far, only for postgresql-8.1.
2008-03-06
Petr Jelinek has ported the patch to postgresql-8.3. Thanks a lot, Petr! It is committed to the FreeBSD port now. I also modified it so it seems it is 25% faster for utf-8, by using ICU's strcollIter, saving us the work to convert to UTF-16 that is otherwise needed.
2008-03-17: Update the patch for postgresql-8.3:
In some locales wcscoll() can claim that nonidentical strings are equal. Believing that this might be so also for ICU, and believing that would be bad news for a number of reasons, we follow Perl's lead and sort "equal" strings according to strcmp (on the byte representation).
2008-08-28: Chase update of autoconf to 2.62
Minimal update to the patches in order to handle revision bump of autoconf, from 2.61 to 2.62.
2009-09-15
Rewrap support for PostgreSQL 8.4
2010-05-15
Rewrap and chase ICU updates
2010-09-15
PostgreSQL 9.0 support
2012-12-19
Rewrap and chase ICU updates
2014-02-10
Chase ICU updates
2014-05-18
Updated patch for PostgreSQL-9.4
2015-04-18
Updated patch for ICU-5.5
2016-08-10
Major upgrade. Support for setting COLLATE per column and per operation. Also performance fixes, and add support for PostgreSQL 9.5 and 9.6.

I recommend you to use the FreeBSD port, but if you rather build postgresql yourself, you can download the latest version.


Palle Girgensohn, girgen@FreeBSD.org