## SQLite: Can Using Unicode Collations be Improved? Attempt 2

In my quest to improve using Unicode collations in SQLite, I figured I needed to accomplish two things first. First, persist the locale for the DB between closing and opening it again. Second, change the default collation from BINARY (ASCII) to the a collator for the locale.

Here is the logic that I used for my implementation:
1. try to load the ICU collation for the given locale or use the system’s default locale
2. check if any tables other than the “locale table” exist. If they exist, exit
3. update the “locale table” with the name of the locale. (This allows reloading the collation when opening the DB)
4. using the ICU collation,  create an sqlite3 collation and make it the default collation for the DB

This is some of the source code for implementing the logic above:

int setDBDefaultLocale(sqlite3 * db, const char * const zLocale)
{
int rc;
sqlite3_stmt* stmt;
UErrorCode status = U_ZERO_ERROR;
UCollator *pUCollator= ucol_open(zLocale ? zLocale : uloc_getDefault(), &status);

if(!U_SUCCESS(status))
{
return SQLITE_ERROR;
}
// check if any tables other than the "locale table" have been created.
// if they exist, just quit, can not change default locale if a another table has already been created
if ((rc = sqlite3_prepare_v2(db, "select * from sqlite_master where name != 'locale_table' and type = 'table'", 76, &stmt, 0)) != SQLITE_OK ||
(rc = sqlite3_step(stmt)) != SQLITE_ROW)
{
sqlite3_finalize(stmt);

return rc;
}

if (sqlite3_column_int(stmt, 0))
{
sqlite3_finalize(stmt);
localeErrMsg(vdbe);

return SQLITE_MISUSE;
}
// creates a locale table if one doesn't exist and updates it with the value of zLocale
updateLocaleTable(zLocale ? zLocale : uloc_getDefault());
// create the collation
if ((rc = sqlite3_create_collation_v2(db, "LOCALE", SQLITE_UTF16,
(void *)pUCollator, icuCollationColl, icuCollationDel)) != SQLITE_OK)
{
ucol_close(pUCollator);

return rc;
}

// set the default collating sequence to the locale collation
db->pDfltColl = sqlite3FindCollSeq(db, SQLITE_UTF16, "LOCALE", 0);

return db->pDfltColl ? SQLITE_OK : SQLITE_ERROR;
}


With the implementation above, I was able to accomplish the first task i.e. persist the locale between connections. But when I tried to change the locale, it would throw an unexpected exception. After much digging, I figured out that the call to sqlite3FindCollseq was returning a buggy collating sequence. There was a mismatch between the DB encoding and the collating sequence encoding. That was solved by querying the encoding of the db object and passing that into the function call:

...
db->pDfltColl = sqlite3FindCollSeq(db, ENC(db), "LOCALE", 0);
...


But as I thought about it, because this implementation always creates a table, this breaks SQLite’s functionality that depends on having an empty database. (e.g. setting the database encoding, pragma encoding = ‘UTF-16’)

As such, I am now investigating other ways to persist the locale. I am thinking of actually appending the locale to the header of the database file. One of my colleagues suggested persisting the locale outside of the database, using either another DB instance or a file. I am now pursuing the file header option. So far I have found out that I may need to modify the Pager (pager.c) implementation, to add more space to store the locale and the Btree (btree.c) implementation, specifically the sqlite3BtreeOpen method to both read in the locale. From what I found out so far, seems like nothing has to change to persist the locale to the disk.

So now the fun begins to implement the above. Here goes…

## SQLite: Can using Unicode collations be improved? Attempt 1

Suppose you are working with a string data other than English. How well will SQLite handle that data?  Turns out that SQLite support for languages other than English is not turned on by default. Your non-English data will most probably produce some unexpected results.

Last week I began on extending SQLite to provide a more “user friendly”  Unicode support. The motivation behind this is to get SQLite into the NexJ Express framework.  For this to happen, it has to have support full Unicode suport.

Out of the box, SQLite provides the following

• Unicode encoding of characters (UTF-8 & UTF-16)
• Support for custom collations
• Support for locales through the ICU library
• Open source code (** my favourite)

The default string collation that SQLite uses is BINARY. This collation just implements ASCII character manipulations. It “understands” only the 26 letters of the English alphabet. Feed the string manipulation functions unicode encoded characters for other languages and it breaks. The SQLite way to fix this is to
1. Get a build of SQLite with ICU enabled. It is easiest to just recompile. The ICU enabled library I found was built on an old version of SQLite.
2. Create an icu collation using “select icu_load_collation([locale], [collation name])
3. Attach the collation to columns using the keyword collate. E.g. create table foo (bar COLLATE [collation name])
4. If using full-text search then, use “create virtual table foo (bar tokenizer=icu [locale])

This works fine to an extent. The problem I immediately noticed is that SQLite does not remember custom collations. Every time a database is open, the “icu_load_collation” statements needs to be run. In the case of the NexJ Express framework, the locale is not available to the code that opens connections. So the idea I had was to enable persisting of custom collations. A colleague of mine implemented this using a table that is updated with the locale and collation name. When a connection is opened, the library loads the collation from the information in the “locale table”.

I thought of taking this further by modifying the database open code to instead set the system’s locale as SQLite’s default collation. I have an implementation of this but it breaks when the locale is changed.

This week I will spend time on finding a fix for my implementation otherwise, will just use the one my colleague came up with.

Tagged , , , ,

## SQLite Adapter for NexJ Express – Overcoming the limitations of SQLite part 2

This is a continuation of my previous post: SQLite Adapter for NexJ Express – Overcoming the limitations of SQLite part 1.

I made mention of  the error codes SQLITE_LOCKED and SQLITE_BUSY the last time I blogged. These errors show up when new connections try to either read or write to an SQLite database instance that already has a connection with an active write operation.  Question 5 from the SQLite FAQ provides a quick description of the overall problem. The gist of it is that when interacting with an SQLite database using multiple concurrent connections, it only takes one writer connection to lock the database file and lock out all the rest. NexJ Express makes use of threading on the persistence layer and as such ran into this issue.

For several weeks we tried various solutions ranging from setting statement query time outs, creating an sqlite3_busy_handler to creating a singleton connection for all threads. The final solution was found after reading about transactions in chapter 5 of “The Definitive Guide to SQLite, Second Edition”. The chapter explains the transaction life cycle and the transition between the different lock states. Write operations cycle through the following states
UNLOCKED -> PENDING (1) -> SHARED -> RESERVED -> PENDING (2) -> EXCLUSIVE  -> COMMIT

Of note are PENDING states; PENDING locks are gateway locks.
Before a writer dumps to the disk, it needs to get an EXCLUSIVE lock. From RESERVED state when the writer gets PENDING(2), it affects the database in two ways:
1. It will hold onto the PENDING(2) lock no matter what
2. In PENDING(2) lock, no other connection can go from UNLOCKED to SHARED. This creates a deadlock  when a previous writer has committed a transaction and it tries to get a SHARED lock. The PENDING(2) lock prevents that previous writer from entering PENDING(1) hence a deadlock.

Because of the second effect noted above, all the previous trial solutions failed. The application would retry infinitely but not get an EXCLUSIVE lock. The book suggested the use of the BEGIN IMMEDIATE command instead of the BEGIN command to start a write transaction. This prevents the deadlock situation noted in 2 above. Here is an extract from The Definitive Guide to SQLite to give more detail:

Since you know you want to write to the database, then you need to start by issuing begin IMMEDIATE. If you get a SQLITE_BUSY, then at least you know what state you’re in. You know you can safely keep trying without holding up another connection. And once you finally do succeed, you know what state you are in then as well—RESERVED. Now you can use brute force if you have to because you are the one in the right. If you start with a begin exclusive, on the other hand, then you are assured that you won’t have any busy conditions to deal with at all. Just remember that in this case you are doing your work in EXCLUSIVE, which is not as good for concurrency as doing the work in RESERVED"

As such, the code for setAutoCommit in the Conn class was changed to use “begin immediate“. The application then needed to keep retrying if it received an SQLITE_LOCKED or SQLITE_BUSY errors. A busy handler was created that returns 1 (i.e. try again). This wasn’t enough since the application still received the SQLITE_LOCKED and SQLITE_BUSY exceptions even when trying to set auto commit to false. Seems it has to do with SQLite internal detecting a deadlock and just throwing an exception. Changes needed to be made to the org.sqlite.PrepStmt and org.sqlite.Conn classes. The methods: execute, executeBatch, executeQuery, executeUpdate, setAutoCommit, rollback and commit, were wrapped in try-catch blocks  to keep retrying if either an SQLITE_LOCKED or SQLITE_BUSY exception were thrown. The source code for this is in nexj.core.persistence.sql.SQLiteDataSourceWrapper.

With that solution implemented, the server now knows how to handle a locked database file. Using begin immediate has some side effects though and as such some unit tests in the JDBC driver fail.

The source code for the SQLite adapter is shared on bitbucket.org/cwdesautels/nexj-express-sqlite-adapter in the ‘sqlite’ branch. Check out the SQLite* classes in the package nexj.core.persistence.sql.
The source code to build a customized Xerial SQLite JDBC driver and SQLite shell can be found at bitbucket.org/gbatumbya/nexj-express-sqlite. (This repo use subrepos to track changes to gbatumbya/sqlite and gbatumbya/sqlite-jdbc)

## SQLite Adapter for NexJ Express – Overcoming the limitations of SQLite part 1

For the last two semesters I have been working on an SQLite persistence adapter for NexJ Model Server Express.  Today it is finally ready for the first code review.

Getting to this stage has taken much longer than I first anticipated. Turns out that SQLite’s “liteness” was turning into a blocker and at one point I even thought we would abandon the project altogether. I should have known this though. The SQLite website says to “Think of SQLite not as a replacement for Oracle but as a replacement for fopen()” and provides a list of situations where other RDBMS work better. In developing this adapter, we were actually pitting SQLite up against the likes of Oracle, PostgreSQL, MySQL, SQL server and other enterprise SQL servers.

First to be tackled was SQLite’s implementation of SQL92 or rather lack thereof; granted though, the documentation for supported features/syntax is very good. We needed to come up with work-arounds to do things that are taken for granted when developing for an enterprise level RDBMSs. For example it didn’t support altering a column’s definition or table constraint, dropping a column or constraint and there is no procedures. Most of the work-arounds required renaming the old table, creating the new table and copying data from the old to the new and then finally deleting the old table.
There was also the case where it was required to convert columns containing hex string values into BLOBs. SQLite provides the x'ABC123' syntax for this, but it is limited. It does not take columns as input. For this we developed an extension function binary(). It pretty much does what the x'' does only that it is exposed as a function call.
It was also required to provide a ranking of results for FTS queries. Again an extension function rank() was developed. The rank for a returned row is calculated as: number of  local hits / sum of global hits. The source of the extension functions is can be found here sqliteNexjExtension.c.

Next up was selecting a JDBC driver. SQLite does not provide an offical JDBC driver but their wiki provides a list of available drivers.  The decision was taken to go with Xerial SQLite JDBC Driver (a fork of  SQLiteJDBC).  We found this driver to be the most up-to-date  and its easy of use of native SQLite libraries was a big plus. The driver did not provide implementations of XADatasource and XAConnection. The team at NexJ Systems ended up providing us with implementations (nexj.core.persistence.sql.pseudoxa) that wraps DataSource and Connection and exposes them as XAResource‘s. (But just last week, I came across SQLiteConnectionPoolDataSource which I provides XA implementations for the SQLite JDBC driver. We may later on switch to using this.)
We found the driver to be buggy with setting: an encoding on a SQLiteDataSource and query time-out on a statement. Turns out that calling SQLiteDataSource.setEncoding("UTF8") causes an exception. The driver tries to execute the statement, pragma encoding = UTF-8. Note the hypen (-), SQLite expects that to be quoted. The javadoc on Statement.setQueryTimeout(int seconds) reads “Sets the number of seconds the driver will wait for a Statement object to execute to the given number of seconds“. Internal though, the driver calls sqlite3_busy_timeout which “sets a busy handler that sleeps for a specified amount of time when a table is locked”. This ends up having the effect that if a database is locked when first executed, the statement will sleep for the given number of seconds then try a second time instead of trying several times until the given number of seconds have past. The solution for this was to use a TimerTask (nexj.core.util.sql.StatementCancelationTask) that was already implemented.

Next week I will have a post about how the dreaded SQLITE_LOCKED and SQLITE_BUSY were overcome so that SQLite could be used in an JavaEE application server. The source code for the adapter is shared on bitbucket.org/cwdesautels/nexj-express-sqlite-adapter in the ‘sqlite’ branch. Check out the SQLite* classes in the package nexj.core.persistence.sql.

Tagged , , , ,

## NexJ Express Integration Layer Now Supporting JSON

I am glad to write that another project, JSON Integration adapter, to which I contributed passed its final code review and was merged into NexJ Express, at the beginning of February 2012.

It brings back lots of memories of my first project at CDOT during which I was first introduced to JSON as I developed the JSON RPC adapter. Some of my earlier work on classes from the RPC adapter in  particular: JSONWriter and JSONParser were reused in building the integration adapter.

The integration layer in NexJ Express allows a Model to interact with external applications asynchronously. With the addition of this adapter, the integration layer can now produce and consume JSON messages. The most common use case where  the adapter will be most utilized is when NexJ Express has to interact with a web services that serves JSON. JSON (JavaScript Object Notation) is a lightweight data-interchange format. It is easy for humans to read and write. It is easy for machines to parse and generate.

Thanks to Brian Lim who worked on the project at its very beginning during the summer of 2011 and helped get the ball rolling.

Seneca’s Centre for Development of Open Technology (CDOT) provides a physical and virtual environment for the development and research of open source software through collaboration with Seneca, the open source community, business, and other institutions. The centre is an integration point for knowledge, education, and relationships within the open source world.

NexJ Systems is a leading provider of enterprise private cloud software, delivering enterprise customer relationship management (CRM) solutions to the financial services, insurance, and healthcare industries.

Tagged , ,

## PostgreSQL Support Added to NexJ Express

Wow, been a while since I last blogged. But such an occasion calls for putting things aside and taking the time to blog.
After about 1 year of development, the PostgreSQL adapter code passed its final code review (there were 6 in total) and the code was merged into the core repository at mercurial.nexj.com/express. Thanks to Minoo and Anastasia who both contributed to the project in the past semesters.

The experience of developing this adapter involved lots of new learning and lots of refreshers. I learnt that there is more to writing code that interfaces with another system than just making it work. There was an emphasis placed on optimizing the code and making use of the things that PostgreSQL is “good at”. And while I got to appreciate PostgreSQL I also got to know that there were some small annoyances with it (I think that when on windows, creating a table space should understand windows folder paths syntax i.e. it shouldn’t be necessary to change C:\path\to\ts to c:/path/to/ts). I also got to appreciate the notation of open source community. It was neat that when I asked questions on the PostgreSQL forums or IRC, I got some very helpfully pointers from the community members. Thanks to RhodiumToad who answered alot of my questions on IRC and also Hiroshi who helped me out with UUID for PostgreSQL x64 on Windows.

To NexJ Express users out there, you now have a choice between two of the leading open source databases: MySQL and PostgreSQL, to use as the data source for your models.

## PostgreSQL Windows x64: Enabling UUID-OSSP.sql

The background to this is that for that for over 6 months now, I have been working on developing a PostgreSQL Adapter for NexJ Express, a part of Open Health Tools. Though most of the development is done in Java, I have also learnt a lot about PostgreSQL.

The Problem:
Although PostgreSQL has a UUID type, the UUID generation is dependent on the UUID-OSSP.sql module which in turn depends on the OSSP-UUID library.
The PostgreSQL installer for Windows includes the UUID-OSSP.sql module for x86 builds but not for x64 builds (lastest version that I checked was PostgreSQL 9.0.4).
When I asked on #postgresql irc channel, one of the things I was told was that there was not OSSP-UUID library build for windows x64.

The Solution: (tested with PG 9.0.4 x64 build)
Turns out that there was a build of OSSP-UUID for x64 windows in the wild. Thanks to Hiroshi Saito, who replied to my message on the pg_general mailing list.

2. Unzip the contents of the zip to path/to/PostgreSQL/9.0/
3. Open psql and install uuid-ossp.sql using the command:
\i 'path/to/PostgeSQL/9.0/share/contrib/uuid-ossp.sql'

Also, Hiroshi provides the source if you interested on compiling OSSP-UUID on x64 windows for yourself.

## Tying Shoelaces And What is Valid JSON

Up until this week, I thought I knew enough about tying shoelaces and JSON. Turns out that for both, I fell short.
First to the tying of shoelaces. You would have thought that I should have been a pro by now since I have done it from childhood and you may be laughing at me but before you do that, check out this very informative video from a talk from TED 2005 Terry Moore: How to tie your shoes.

So to the techy stuff.
I got introduced to JSON at the beginning of last summer as I worked on creating a JSON RPC adapter for NexJ Express.
At the moment I am also involved in building a JSON Integration Adapter, and as part of the design phase we had to figure out what vaild JSON was.
To my astonishment, I found out that Javascript “primitives”, that is String, Number and Boolean are not valid JSON. My misunderstanding of this was from missing reading the railway diagrams on json.org, I just thought that every thing that was a valid value, was also valid JSON.
So how did I know I was wrong, my colleague Brian, showed me the RFC for JSON, and right there it says
 A JSON text is a serialized object or array. JSON-text = object / array

I also asked another colleague, Dave, who works a lot with Javascript and he started off by saying basically the same thing, JSON is “Object Notation”.
So what does this mean for me? Well, I have authored some web services which serve what I thought was “JSON”, so I may have to go back and fix them.

## PostgreSQL: Create a Large Object (lo) from binary data (bytea) on server side

Again, I found something that NexJ Express uses that PostgreSQL does not have support for, that is creating large objects when given bytea data.
Like with Comparing bytea data to a Large Object on server side, I created a CAST (bytea AS oid) and a procedure to solve this problem.
The algorithm is pretty simple, get the binary data, if it is null, return null. Else create a large object and in the lowrite function, pass it the binary value, instead of a path to a file.

The code for the procedure is below. Note that the lo_manage package should be installed for this to work.

create or replace function blob_write(lbytea bytea)
returns oid
volatile
language plpgsql as
$f$
declare
loid oid;
lfd integer;
lsize integer;
begin
if(lbytea is null) then
return null;
end if;

loid := lo_create(0);
lfd := lo_open(loid,131072);
lsize := lowrite(lfd,lbytea);
perform lo_close(lfd);
return loid;
end;
$f$;
CREATE CAST (bytea AS oid) WITH FUNCTION blob_write(bytea) AS ASSIGNMENT;


So now the following code works:

CREATE TABLE bytea_to_lo (
lo largeObj
);

INSERT INTO bytea_to_lo VALUES ( DECODE('00AB','hex'));


## PostgreSQL Compare bytea data to a Large Object on server side

While working on the PostgreSQL Adapter for NexJ Express, I found out that I needed to compare large objects to bytea data on the server side.
for example:
select * from Table_With_LOs where lo_value = bytea_value;

A little background:
PostgreSQL stores large object data in a separate table (pg_largeobject), and in the data column for the table it uses an oid (an integer value) as a “pointer” to this data.
So, doing "where lo_value = bytea_value", actually translates to
"where integer_value = bytea_value"

Solution:
Based on this blog post, Reading XML Files into Database ,by rhodiumtoad (he is very active in the postgresql irc channel), I created a cast function to go from oid to bytea data.

create or replace function blob_read(loid oid)
returns bytea
volatile
language plpgsql as
$f$
declare
content bytea;
lfd integer;
lsize integer;
begin
if(loid is null) then
return null;
end if;
lfd := lo_open(loid,262144);
lsize := lo_lseek(lfd,0,2);
perform lo_lseek(lfd,0,0);
$f$;