Unicode

According to the specifications for freedb, all text data in the database should be encoded in one of three ways. Given this requirement, it is relatively easy to determine which encoding method has been used for any given record. Since the encoding method has not been stored explicitly anywhere in the database, it is necessary to use a simple test which is described below. Having identified the encoding, it is then desirable to convert all text to the same encoding for ease of searching.

The valid encoding methods are the following:

7-bit ASCII which is the character encoding scheme that has been used almost since the first electronic computers. It is capable of representing simple punctuation, numbers, and the non-accented upper and lower case letters of the English alphabet. It is recognisable because even though modern computers store all characters in 8-bit bytes, ASCII only uses the low order 7 bits allowing a maximum of 128 distinct values. Therefore if a CDDB record contains no characters with a value greater than decimal 127, it is assumed to be encoded as ASCII and can be left unaltered.

ISO-8859-1 or Latin-1 as it was formerly known. This character encoding includes many more symbols and accented letters so it can be used to represent a fair coverage of European languages. It is restricted to a total of 256 possible values, not all of which are used, and includes the 128 values of the ASCII character set. For more information see the excellent article at http://en.wikipedia.org/wiki/ISO_8859-1.

UTF-8 is one possible encoding of Unicode. This is the universal character encoding scheme which has been developed to permit the electronic representation of virtually all known human languages. While Unicode allows for characters that may be up to four bytes in length, it can be inefficient and awkward in many applications. An alternative coding scheme maps the two or four byte sequences of Unicode to sequences of one to five bytes in length called UTF-8. These sequences have the property that not all possible sequences containing 8-bit bytes are valid. Hence if a CDDB record contains characters with a value grater than decimal 127, but they do not form a valid UTF-8 string, then the record must be encoded as Latin-1. Otherwise it is assumed to be encoded as UTF-8. Once again, turn to wikipedia for more information here http://en.wikipedia.org/wiki/Unicode.

It was not difficult to further enhance the C program which was introduced earlier. The new version is called unicode-archive.c. It is capable of detecting the encoding used in each CDDB record and converting Latin-1 characters to the equivalent UTF-8 sequences, but unfortunately there is one small problem.

Although all records in freedb are supposed to be one of ASCII, Latin-1 or UTF-8, over the years there have been many records accepted in to the database which were actually encoded using some other 8-bit scheme. Internally, some of these records "look" like they are Latin-1 and some of them "look" like they are UTF-8. There is no deterministic way to tell them apart. However when they are displayed on screen using a Unicode aware browser or other utility they look like gibberish.

To assist the process of identifying these rogue encodings, the current version of unicode-archive.c prepends every piece of text with a letter to indicate what it guessed the original encoding to be, before it was converted to UTF-8. These prefixes are A for ASCII, L for Latin-1 and U for UTF-8. When displaying text from a database containing data converted by unicode-archive.c you must remember to drop the first letter of every string for now. Alternatively, you can modify unicode-archive.c to avoid importing the tagged text in the first place.

As before, the conversion program outputs a mixture of disc, track and link records which can be separated before importing them to the program. The following table definitions describe the format of each of these record types.

CREATE TABLE iDisc
(
    mDiscNumber INTEGER,
    fLength INTEGER,
    fRevision INTEGER,
    fProcessor TEXT,
    fSubmitter TEXT,
    fLinks TEXT,
    fArtist TEXT,
    fTitle TEXT,
    fYear INTEGER,
    fGenre TEXT,
    fComment TEXT,
    fCategory CHARACTER,
    fFileName INTEGER,
    fDateTime INTEGER,
    fFileSize INTEGER,
    sTracks INTEGER,
    sVarious BOOLEAN,
    cDirty BOOLEAN,
    cEncoding CHARACTER,
    cNumbered BOOLEAN
);

CREATE TABLE iLink
(
    mDiscNumber INTEGER,
    fLinkCategory CHARACTER,
    fLinkName TEXT,
    fDateTime INTEGER,
    fFileCategory CHARACTER,
    fFileName INTEGER,
    fSource CHARACTER
);

CREATE TABLE iTrack
(
    mDiscNumber INTEGER,
    fNumber INTEGER,
    fFrame INTEGER,
    fLength INTEGER,
    fArtist TEXT,
    fTitle TEXT,
    fComment TEXT
);

Though it has been a long time since this site was updated, the author has been doing a great deal of development work on freedb in the interim. It is my intention to document more of that effort here as soon as possible.

Home > Database > freedb > Unicode