Tag Archives: nexjexpress

SQLite: Can using Unicode collations be improved? Attempt 1

Suppose you are working with a string data other than English. How well will SQLite handle that data?  Turns out that SQLite support for languages other than English is not turned on by default. Your non-English data will most probably produce some unexpected results.

Last week I began on extending SQLite to provide a more “user friendly”  Unicode support. The motivation behind this is to get SQLite into the NexJ Express framework.  For this to happen, it has to have support full Unicode suport.

Out of the box, SQLite provides the following

  • Unicode encoding of characters (UTF-8 & UTF-16)
  • Support for custom collations
  • Support for locales through the ICU library
  • Open source code (** my favourite)

The default string collation that SQLite uses is BINARY. This collation just implements ASCII character manipulations. It “understands” only the 26 letters of the English alphabet. Feed the string manipulation functions unicode encoded characters for other languages and it breaks. The SQLite way to fix this is to
1. Get a build of SQLite with ICU enabled. It is easiest to just recompile. The ICU enabled library I found was built on an old version of SQLite.
2. Create an icu collation using “select icu_load_collation([locale], [collation name])
3. Attach the collation to columns using the keyword collate. E.g. create table foo (bar COLLATE [collation name])
4. If using full-text search then, use “create virtual table foo (bar tokenizer=icu [locale])

This works fine to an extent. The problem I immediately noticed is that SQLite does not remember custom collations. Every time a database is open, the “icu_load_collation” statements needs to be run. In the case of the NexJ Express framework, the locale is not available to the code that opens connections. So the idea I had was to enable persisting of custom collations. A colleague of mine implemented this using a table that is updated with the locale and collation name. When a connection is opened, the library loads the collation from the information in the “locale table”.

I thought of taking this further by modifying the database open code to instead set the system’s locale as SQLite’s default collation. I have an implementation of this but it breaks when the locale is changed.

This week I will spend time on finding a fix for my implementation otherwise, will just use the one my colleague came up with.

Tagged , , , ,