Monthly Archives: June 2012

SQLite: Can Using Unicode Collations be Improved? Attempt 2

In my quest to improve using Unicode collations in SQLite, I figured I needed to accomplish two things first. First, persist the locale for the DB between closing and opening it again. Second, change the default collation from BINARY (ASCII) to the a collator for the locale.

Here is the logic that I used for my implementation:
1. try to load the ICU collation for the given locale or use the system’s default locale
2. check if any tables other than the “locale table” exist. If they exist, exit
3. update the “locale table” with the name of the locale. (This allows reloading the collation when opening the DB)
4. using the ICU collation,  create an sqlite3 collation and make it the default collation for the DB

This is some of the source code for implementing the logic above:

int setDBDefaultLocale(sqlite3 * db, const char * const zLocale)
{
  int rc;
  sqlite3_stmt* stmt;
  UErrorCode status = U_ZERO_ERROR;
  UCollator *pUCollator= ucol_open(zLocale ? zLocale : uloc_getDefault(), &status);

  if(!U_SUCCESS(status))
  {
    return SQLITE_ERROR;
  }
  // check if any tables other than the "locale table" have been created.
  // if they exist, just quit, can not change default locale if a another table has already been created
  if ((rc = sqlite3_prepare_v2(db, "select * from sqlite_master where name != 'locale_table' and type = 'table'", 76, &stmt, 0)) != SQLITE_OK ||
      (rc = sqlite3_step(stmt)) != SQLITE_ROW)
  {
     sqlite3_finalize(stmt);

      return rc;
  }

  if (sqlite3_column_int(stmt, 0))
  {
    sqlite3_finalize(stmt);
    localeErrMsg(vdbe);

    return SQLITE_MISUSE;
  }
  // creates a locale table if one doesn't exist and updates it with the value of zLocale
  updateLocaleTable(zLocale ? zLocale : uloc_getDefault());
  // create the collation
  if ((rc = sqlite3_create_collation_v2(db, "LOCALE", SQLITE_UTF16,
  (void *)pUCollator, icuCollationColl, icuCollationDel)) != SQLITE_OK)
  {
    ucol_close(pUCollator);

    return rc;
  }

  // set the default collating sequence to the locale collation
  db->pDfltColl = sqlite3FindCollSeq(db, SQLITE_UTF16, "LOCALE", 0);

  return db->pDfltColl ? SQLITE_OK : SQLITE_ERROR;
}

With the implementation above, I was able to accomplish the first task i.e. persist the locale between connections. But when I tried to change the locale, it would throw an unexpected exception. After much digging, I figured out that the call to sqlite3FindCollseq was returning a buggy collating sequence. There was a mismatch between the DB encoding and the collating sequence encoding. That was solved by querying the encoding of the db object and passing that into the function call:

...
  db->pDfltColl = sqlite3FindCollSeq(db, ENC(db), "LOCALE", 0);
...

But as I thought about it, because this implementation always creates a table, this breaks SQLite’s functionality that depends on having an empty database. (e.g. setting the database encoding, pragma encoding = ‘UTF-16’)

As such, I am now investigating other ways to persist the locale. I am thinking of actually appending the locale to the header of the database file. One of my colleagues suggested persisting the locale outside of the database, using either another DB instance or a file. I am now pursuing the file header option. So far I have found out that I may need to modify the Pager (pager.c) implementation, to add more space to store the locale and the Btree (btree.c) implementation, specifically the sqlite3BtreeOpen method to both read in the locale. From what I found out so far, seems like nothing has to change to persist the locale to the disk.

So now the fun begins to implement the above. Here goes…

Tagged , , , , , , ,

SQLite: Can using Unicode collations be improved? Attempt 1

Suppose you are working with a string data other than English. How well will SQLite handle that data?  Turns out that SQLite support for languages other than English is not turned on by default. Your non-English data will most probably produce some unexpected results.

Last week I began on extending SQLite to provide a more “user friendly”  Unicode support. The motivation behind this is to get SQLite into the NexJ Express framework.  For this to happen, it has to have support full Unicode suport.

Out of the box, SQLite provides the following

  • Unicode encoding of characters (UTF-8 & UTF-16)
  • Support for custom collations
  • Support for locales through the ICU library
  • Open source code (** my favourite)

The default string collation that SQLite uses is BINARY. This collation just implements ASCII character manipulations. It “understands” only the 26 letters of the English alphabet. Feed the string manipulation functions unicode encoded characters for other languages and it breaks. The SQLite way to fix this is to
1. Get a build of SQLite with ICU enabled. It is easiest to just recompile. The ICU enabled library I found was built on an old version of SQLite.
2. Create an icu collation using “select icu_load_collation([locale], [collation name])
3. Attach the collation to columns using the keyword collate. E.g. create table foo (bar COLLATE [collation name])
4. If using full-text search then, use “create virtual table foo (bar tokenizer=icu [locale])

This works fine to an extent. The problem I immediately noticed is that SQLite does not remember custom collations. Every time a database is open, the “icu_load_collation” statements needs to be run. In the case of the NexJ Express framework, the locale is not available to the code that opens connections. So the idea I had was to enable persisting of custom collations. A colleague of mine implemented this using a table that is updated with the locale and collation name. When a connection is opened, the library loads the collation from the information in the “locale table”.

I thought of taking this further by modifying the database open code to instead set the system’s locale as SQLite’s default collation. I have an implementation of this but it breaks when the locale is changed.

This week I will spend time on finding a fix for my implementation otherwise, will just use the one my colleague came up with.

Tagged , , , ,