SQLite: Can using Unicode collations be improved? Attempt 1

Suppose you are working with a string data other than English. How well will SQLite handle that data?  Turns out that SQLite support for languages other than English is not turned on by default. Your non-English data will most probably produce some unexpected results.

Last week I began on extending SQLite to provide a more “user friendly”  Unicode support. The motivation behind this is to get SQLite into the NexJ Express framework.  For this to happen, it has to have support full Unicode suport.

Out of the box, SQLite provides the following

  • Unicode encoding of characters (UTF-8 & UTF-16)
  • Support for custom collations
  • Support for locales through the ICU library
  • Open source code (** my favourite)

The default string collation that SQLite uses is BINARY. This collation just implements ASCII character manipulations. It “understands” only the 26 letters of the English alphabet. Feed the string manipulation functions unicode encoded characters for other languages and it breaks. The SQLite way to fix this is to
1. Get a build of SQLite with ICU enabled. It is easiest to just recompile. The ICU enabled library I found was built on an old version of SQLite.
2. Create an icu collation using “select icu_load_collation([locale], [collation name])
3. Attach the collation to columns using the keyword collate. E.g. create table foo (bar COLLATE [collation name])
4. If using full-text search then, use “create virtual table foo (bar tokenizer=icu [locale])

This works fine to an extent. The problem I immediately noticed is that SQLite does not remember custom collations. Every time a database is open, the “icu_load_collation” statements needs to be run. In the case of the NexJ Express framework, the locale is not available to the code that opens connections. So the idea I had was to enable persisting of custom collations. A colleague of mine implemented this using a table that is updated with the locale and collation name. When a connection is opened, the library loads the collation from the information in the “locale table”.

I thought of taking this further by modifying the database open code to instead set the system’s locale as SQLite’s default collation. I have an implementation of this but it breaks when the locale is changed.

This week I will spend time on finding a fix for my implementation otherwise, will just use the one my colleague came up with.

Tagged , , , ,

2 thoughts on “SQLite: Can using Unicode collations be improved? Attempt 1

  1. Pomax says:

    That’s odd. I’ve been using SQLite 3 for a long time in combination with unicode strings now, and I’ve never had to tell it to use a specific charset. That said, I’ve always talked to the database files using tools that understood unicode, so that could be part of the problem.

    • gbatumbya says:

      “i’ve always talked to the database files using tools that understood unicode”
      That is not a problem actually, that seems to be what the developers of SQLite encourage. The database should just be a store for data handing off unicode related manipulation to the other applications.
      In our case, we do need SQLite to do unicode string manipulations like ordering Japanese characters for example.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: