Associated Files: UTFDecoder.zip
This collection of classes, when incorporated into an application, will allow characters to be extracted from files encoded using any of the Unicode Standard (version 4.1.0) encoding schemes. Note, at the time of writing, the Unicode Consortium have just announced the release of version 5.0.0 of the Unicode Standard. The classes referred to in this article have not, as yet, been tested for conformance to this new version.
Before describing the operation of the UTF Decoder classes, perhaps you are wondering what they could be used for and why I created them? Well, to cut a long story short, I was working on an application which needed to extract characters, one at a time, from an XML file. Straight forward enough I hear you say, but I then realised that the XML files could contain characters from any of the world's languages! How could I handle these additional characters in my application in a generic sort of way?
After much consternation, I decided a good way to handle these characters would be to convert each of them to an 'unsigned long', or, using Unicode terminology, UTF-32. The XML files could use any of the Unicode Standard encodings, but my application would use UTF-32 through out. I would obviously have to create a UTF-32 string class if I needed to manipulate the character strings directly, but this would not pose too much of a problem.
Anyway, the classes included in the UTF Decoder collection, available via the downloads page, represent my attempt at creating the arrangement described above. If you use the classes in your application, you may wonder how to identify which encoding a particular XML file employs? Typically, your application could examine the signature bytes at the beginning of the file, if present. Where these signature bytes are not present, you should assume that the file is encoded as UTF-8, as per the Unicode Standard.
To use any of the UTF Decoders in your application, simply create a new instance of the relevant decoder, supplying the full path and name of your encoded file along with signature and file offsets if required. Then call 'GetCharacter' to start retrieving decoded characters. If you use the default constructor to create an instance of a decoder, you must call 'Initialise', supplying the relevant info before calling 'GetCharacter'. Normally, the signature and file offsets would be set to zero to start reading from the beginning of a file. The signature offset would be equal to the number of bytes in any signature which may exist at the beginning of an encoded file.