It is desirable to put a database into optimal normal form to reduce the level of redundancy in the stored information. This helps to avoid the risk of corrupting the database by introducing inconsistent or ambigous information, however the process may or may not make the database more efficient to use. By storing a particular set of values only once, the storage and input/output requirements may be reduced. On the other hand, considerably more computation may be required to reconstruct derivable values. reserved

The first step is to identify sets of common values in the database and move them into a common table, creating master/detail associations. This has already been done in the data conversion where groups of track records refer to a single disc record which contains the information about each CD that is common to all the tracks on the CD.

The DTitle and TTitle fields should also be split into separate Artist and Title fields. The convention in freedb is that Artist and Title are separated by <space><slash><space>. If the separator is not present then in the case of an album it is a self titled album, and in the case of a track it is by the artist who created the album.

At this point there are a number of text fields which have many values that are common to many records. The unique text values can be removed to separate control tables and replaced with just a reference to the record in the control table. This will create separate record types for:

  • the submitting software of an album,
  • the server software that accepted it,
  • the genre description,
  • the artist and
  • the album or track title.

Since over the years there has been some confusion among contributors as to the expected ordering of artist and title, these values will be stored in a single generic Text record type. It is to be hoped that at some point later in the processing, something can be done to sort these out.

The original file path of each record should also be separated into two fields:

  • a category code for the directory and
  • the discid that is used as the file name.

To save space a single character will be used to define the category and the eight digit hexadecimal file name will be converted to a 32 bit integer for efficient handling in the database. Although in most cases, the discid field is also a single eight digit hexadecimal value, some of the values are lists of discids which partially reproduce the link information. They will be handled specially at a later date.

To save time, the length in frames of each track is calculated and stored now, as well as a flag to indicate when there is suspicious looking content in artist and title values. It would be desirable to:

  • replace control characters with spaces,
  • replace runs of spaces with a single space and
  • remove redundant track numbers from the beginning of track titles.

Although all of the above processing steps could be performed using SQL, they are more efficiently carried out by further extending the C program which has been used previously for extracting the data. The new version is called analyze-archive.c. Note that this program also takes the liberty of producing some extra fields that will greatly improve the efficiency of certain processing operations later on. These values are redundant from the point of view of a database design in optimal normal form, but it takes only a few more seconds of computation to obtain them at this stage.

The following linux commands extract and summarise specific columns to be loaded into the control tables. They produce tab separated files containing a unique value and the number of times that it occurs in the database. In the case of artists and titles, separate counts are calculated for each possible context: disc-artist, disc-title, track-artist, track-title. These four files are then combined into one. To ensure consistency for all of the following sorting operations be sure to set the locale to a known value. Using C as the locale performs a purely binary sort and is much faster than POSIX or other language specific locales. Note that it will be necessary to substitute a literal tab for <backslash>t in the commands below. You may need to type <control-V><TAB> to get your shell to accept a tab character.


make analyze-archive

bunzip2 < freedb-complete-20050104.tar.bz2 |\
./analyze-archive > freedb-complete-20050104.tab

grep '^[^0-9A-Z]' freedb-complete-20050104.tab > discs.tab

grep '^[0-9]' freedb-complete-20050104.tab > tracks.tab

grep '^[A-Z]' freedb-complete-20050104.tab > links.tab

export LC_ALL=C
export LC_COLLATE=C

cut --fields=5 < discs.tab |\
sort --stable --field-separator="\t" --key=1 |\
awk --file group-lines.awk > processor.tab

cut --fields=6 < discs.tab |\
sort --stable --field-separator="\t" --key=1 |\
awk --file group-lines.awk > submitter.tab

cut --fields=8 < discs.tab |\
sort --stable --field-separator="\t" --key=1 |\
awk --file group-lines.awk > disc-artists.tab

cut --fields=9 < discs.tab |\
sort --stable --field-separator="\t" --key=1 |\
awk --file group-lines.awk > disc-titles.tab

cut --fields=11 < discs.tab |\
sort --stable --field-separator="\t" --key=1 |\
awk --file group-lines.awk > genre.tab

cut --fields=5 < tracks.tab |\
sort --stable --field-separator="\t" --key=1 |\
awk --file group-lines.awk > track-artists.tab

cut --fields=6 < tracks.tab |\
sort --stable --field-separator="\t" --key=1 |\
awk --file group-lines.awk > track-titles.tab

join -a1 -a2 -e0 -o0,1.2,2.2 -t"\t" disc-artists.tab disc-titles.tab |\
sed 1s/^0// |\
join -a1 -a2 -e0 -o0,1.2,1.3,2.2 -t"\t" - track-artists.tab |\
sed 1s/^0// |\
join -a1 -a2 -e0 -o0,1.2,1.3,1.4,2.2 -t"\t" - track-titles.tab |\
sed 1s/^0// > text.tab


The awk script group-lines.awk is listed below. Unless you have a large amount of memory, you will find that the traditional awk approach of using associative arrays does not cope well with the amount of data that has just been plucked from more than 20 million track records. By pre-sorting the data and then passing it through a stream processor, the required work can be done very fast with a minimum amount of memory.


BEGIN { T = ""; C = 0; }

{
    if ($0 "" == T)
        C++;
    else {
        if (C > 0)
            printf("%s\t%d\n",T,C);
        T = $0;
        C = 1;
    }
}

END { if (C > 0) printf("%s\t%d\n",T,C); }


Now here are the revised and additional table definitions that are needed for importing the data.


CREATE TABLE iDisc
(
    fDummy TEXT NOT NULL,
    fDiscNumber INTEGER NOT NULL,
    fDiscLength INTEGER NOT NULL,
    fRevision INTEGER NOT NULL,
    fProcessor TEXT NOT NULL,
    fSubmitter TEXT NOT NULL,
    fDiscId TEXT NOT NULL,
    fArtist TEXT NOT NULL,
    fTitle TEXT NOT NULL,
    fDYear TEXT NOT NULL,
    fDGenre TEXT NOT NULL,
    fExtD TEXT NOT NULL,
    fCategory CHAR NOT NULL,
    fFile INTEGER NOT NULL,
    fTime INTEGER NOT NULL,
    fSize INTEGER NOT NULL,
    sTracks INTEGER NOT NULL,
    sVarious BOOLEAN NOT NULL,
    cDirty BOOLEAN NOT NULL
);

CREATE TABLE iGenre
(
    fGenre TEXT NOT NULL,
    sDiscs INTEGER NOT NULL
);

CREATE TABLE iLink
(
    fLinkCat CHAR NOT NULL,
    fLinkId INTEGER NOT NULL,
    fTime INTEGER NOT NULL,
    fFileCat CHAR NOT NULL,
    fFileId INTEGER NOT NULL
);

CREATE TABLE iProcessor
(
    fProcessor TEXT NOT NULL,
    sDiscs INTEGER NOT NULL
);

CREATE TABLE iSubmitter
(
    fSubmitter TEXT NOT NULL,
    sDiscs INTEGER NOT NULL
);

CREATE TABLE iText
(
    fText TEXT NOT NULL,
    sDiscArtist INTEGER NOT NULL,
    sDiscTitle INTEGER NOT NULL,
    sTrackArtist INTEGER NOT NULL,
    sTrackTitle INTEGER NOT NULL
);

CREATE TABLE iTrack
(
    fTrack INTEGER NOT NULL,
    fDiscNumber INTEGER NOT NULL,
    fFrame INTEGER NOT NULL,
    fLength INTEGER NOT NULL,
    fArtist TEXT NOT NULL,
    fTitle TEXT NOT NULL,
    fExtT TEXT NOT NULL,
    cDirty BOOLEAN NOT NULL
);


Import the data with the following commands:


psql --command "COPY iDisc FROM STDIN" freedb < discs.tab

psql --command "COPY iGenre FROM STDIN" freedb < genre.tab

psql --command "COPY iLink FROM STDIN" freedb < links.tab

psql --command "COPY iProcessor FROM STDIN" freedb < processor.tab

psql --command "COPY iSubmitter FROM STDIN" freedb < submitter.tab

psql --command "COPY iText FROM STDIN" freedb < text.tab

psql --command "COPY iTrack FROM STDIN" freedb < tracks.tab


The imported data is to be merged with a number of permanent tables. If the database is to remain useful in the long run, it must be possible to update it with new records without having to rebuild it from scratch every time. Hence the following table definitions which allow this to happen:


CREATE TABLE tGenre
(
    kGenre SERIAL PRIMARY KEY,
    fDescription TEXT NOT NULL UNIQUE,
    sDiscs INTEGER NOT NULL
);

CREATE TABLE tProcessor
(
    kProcessor SERIAL PRIMARY KEY,
    fDescription TEXT NOT NULL UNIQUE,
    sDiscs INTEGER NOT NULL
);

CREATE TABLE tSubmitter
(
    kSubmitter SERIAL PRIMARY KEY,
    fDescription TEXT NOT NULL UNIQUE,
    sDiscs INTEGER NOT NULL
);

CREATE TABLE tText
(
    kText SERIAL NOT NULL,
    fText TEXT NOT NULL,
    sDiscArtist INTEGER DEFAULT 0 NOT NULL,
    sDiscTitle INTEGER DEFAULT 0 NOT NULL,
    sTrackArtist INTEGER DEFAULT 0 NOT NULL,
    sTrackTitle INTEGER DEFAULT 0 NOT NULL,
    pName INTEGER DEFAULT 0 NOT NULL
);

INSERT INTO tText(kText,fText) VALUES (0,'');
INSERT INTO tText(fText) VALUES ('Various');

CREATE TABLE tDisc
(
    kDisc SERIAL PRIMARY KEY,
    fDiscNumber INTEGER NOT NULL,
    fLength INTEGER NOT NULL,
    fRevision INTEGER NOT NULL,
    pSubmitter INTEGER NOT NULL,
    pProcessor INTEGER NOT NULL,
    pArtistText INTEGER NOT NULL,
    pTitleText INTEGER NOT NULL,
    fYear INTEGER,
    pGenre INTEGER NOT NULL,
    fComment TEXT NOT NULL,
    pCategory INTEGER NOT NULL,
    fFile INTEGER NOT NULL,
    fTime INTEGER NOT NULL,
    fSize INTEGER NOT NULL,
    sTracks INTEGER NOT NULL,
    sVarious BOOLEAN NOT NULL,
    cDirty BOOLEAN NOT NULL
);

CREATE TABLE tLink
(
    kLink SERIAL,
    pDisc INTEGER NOT NULL,
    pCategory INTEGER NOT NULL,
    fDiscId TEXT NOT NULL,
    fSource CHARACTER NOT NULL
);

CREATE TABLE tTrack
(
    kTrack SERIAL,
    pDisc INTEGER NOT NULL,
    fTrack INTEGER NOT NULL,
    fFrame INTEGER NOT NULL,
    fLength INTEGER NOT NULL,
    pArtistText INTEGER NOT NULL,
    pTitleText INTEGER NOT NULL,
    fComment TEXT NOT NULL,
    cDirty BOOLEAN NOT NULL
);


After the initial bulk load a number of constraints and indices will be added to the database to ensure its efficiency and integrity. However when loading very large amounts of data, they would only slow the process down unnecessarily.

One more control table is needed before the imported data can be processed into the permanent tables. The category table can be created and loaded thus:


CREATE TABLE tCategory
(
    kCategory SERIAL PRIMARY KEY,
    fCode CHAR NOT NULL UNIQUE,
    fName TEXT NOT NULL UNIQUE,
    fDescription TEXT NOT NULL
);

COPY tCategory(fCode,fName,fDescription) FROM STDIN WITH DELIMITER '|';
A|blues|self explanatory
B|classical|self explanatory
C|country|self explanatory
D|data|ISO9660 and other data CDs
E|folk|self explanatory
F|jazz|self explanatory
G|misc|discs that do not fit in the other categories
H|newage|self explanatory
I|reggae|self explanatory
J|rock|including funk, soul, rap, pop, industrial, metal, etc.
K|soundtrack|movies, shows
\.


For each control table, totals for pre-existing values are updated with new ones and new values are appended to the values already there, before deleting the imported records ready for the next batch.

These steps are encapsulated in a single transaction to ensure that if processing is interrupted for any reason, it can be resumed easily without inadvertently repeating any update operations and causing the counters to become inaccurate. Ideally these totals would not be stored and would be calculated on demand, but some of the potential users of this database do not have access to a mainframe computer.

Before commencing the processing, it is a good idea to temporarily increase the amount of memory that PostgreSQL is allowed to use for building hash tables and sorting. There is no point in creating indices at this stage, as the merge operations are so large that the database manager is going to perform a sequence scan on all the tables anyway. Be careful not to increase this value too much or your system may not have enough memory to finish some of the larger steps.

I found 150MB to be a good value for my system which has 1GB of memory. You can determine a good value for your system by trying different values for SORT_MEM and using the EXPLAIN command to predict a cost for each query with that value. When increasing the value of SORT_MEM no longer decreases the expected cost of the query, you have a good value to work with.


SET SORT_MEM = 150000;


BEGIN;

UPDATE tGenre SET sDiscs = tGenre.sDiscs + a.sDiscs
FROM iGenre a WHERE tGenre.fDescription = a.fGenre;

DELETE FROM iGenre WHERE EXISTS
(SELECT * FROM tGenre a WHERE iGenre.fGenre = a.fDescription);

INSERT INTO tGenre(fDescription,sDiscs)
SELECT fGenre,sDiscs FROM iGenre;

TRUNCATE TABLE iGenre;

COMMIT;

VACUUM FULL ANALYZE tGenre;
VACUUM ANALYZE iGenre;


BEGIN;

UPDATE tProcessor SET sDiscs = tProcessor.sDiscs + a.sDiscs
FROM iProcessor a WHERE tProcessor.fDescription = a.fProcessor;

DELETE FROM iProcessor WHERE EXISTS
(SELECT * FROM tProcessor a WHERE iProcessor.fProcessor = a.fDescription);

INSERT INTO tProcessor(fDescription,sDiscs)
SELECT fProcessor,sDiscs FROM iProcessor;

TRUNCATE TABLE iProcessor;

COMMIT;

VACUUM FULL ANALYZE tProcessor;
VACUUM ANALYZE iProcessor;


BEGIN;

UPDATE tSubmitter SET sDiscs = tSubmitter.sDiscs + a.sDiscs
FROM iSubmitter a WHERE tSubmitter.fDescription = a.fSubmitter;

DELETE FROM iSubmitter WHERE EXISTS
(SELECT * FROM tSubmitter a WHERE iSubmitter.fSubmitter = a.fDescription);

INSERT INTO tSubmitter(fDescription,sDiscs)
SELECT fSubmitter,sDiscs FROM iSubmitter;

TRUNCATE TABLE iSubmitter;

COMMIT;

VACUUM FULL ANALYZE tSubmitter;
VACUUM ANALYZE iSubmitter;


BEGIN;

UPDATE tText SET
    sDiscArtist = tText.sDiscArtist + a.sDiscArtist,
    sDiscTitle = tText.sDiscTitle + a.sDiscTitle,
    sTrackArtist = tText.sTrackArtist + a.sTrackArtist,
    sTrackTitle = tText.sTrackTitle + a.sTrackTitle
FROM
    iText a
WHERE
    tText.fText = a.fText;

DELETE FROM iText WHERE EXISTS
(SELECT * FROM tText a WHERE iText.fText = a.fText);

INSERT INTO tText(fText,sDiscArtist,sDiscTitle,sTrackArtist,sTrackTitle)
SELECT fText,sDiscArtist,sDiscTitle,sTrackArtist,sTrackTitle FROM iText;

TRUNCATE TABLE iText;

COMMIT;

VACUUM FULL ANALYZE tText;
VACUUM ANALYZE iText;


The next step will create disc records in which all common values have been replaced with references to the surrogate keys of the relevant control tables.

Note that in the event of a missing value, it is better to use an impossible value like -1 than to let the missing value just be NULL. If you omit the COALESCE expressions and there is a missing value, the transaction will be aborted and you will be none the wiser. But if you substitute -1 for those values you can easily find them afterwards and determine what is missing, and why.


SET SORT_MEM = 150000;


BEGIN;

INSERT INTO tDisc
(
    fDiscNumber,
    fLength,
    fRevision,
    pSubmitter,
    pProcessor,
    pArtistText,
    pTitleText,
    fYear,
    pGenre,
    fComment,
    pCategory,
    fFile,
    fTime,
    fSize,
    sTracks,
    sVarious,
    cDirty
)
SELECT
    a.fDiscNumber,
    a.fDiscLength,
    a.fRevision,
    COALESCE(c.kSubmitter,-1),
    COALESCE(d.kProcessor,-1),
    COALESCE(g.kText,-1),
    COALESCE(h.kText,-1),
    CASE WHEN LENGTH(a.fDYear) BETWEEN 1 AND 8 THEN a.fDYear::INTEGER END,
    COALESCE(e.kGenre,-1),
    a.fExtD,
    COALESCE(f.kCategory,-1),
    a.fFile,
    a.fTime,
    a.fSize,
    a.sTracks,
    a.sVarious,
    a.cDirty
FROM
    iDisc a
    LEFT JOIN tSubmitter c ON a.fSubmitter = c.fDescription
    LEFT JOIN tProcessor d ON a.fProcessor = d.fDescription
    LEFT JOIN tGenre e ON a.fDGenre = e.fDescription
    LEFT JOIN tCategory f ON a.fCategory = f.fCode
    LEFT JOIN tText g ON a.fArtist = g.fText
    LEFT JOIN tText h ON a.fTitle = h.fText;

TRUNCATE TABLE iDisc;

COMMIT;

VACUUM ANALYZE tDisc;
VACUUM ANALYZE iDisc;


This one is going to take a while. You should maybe go watch a movie.


SET SORT_MEM = 150000;


BEGIN;

INSERT INTO tTrack
(
    pDisc,
    fTrack,
    fFrame,
    fLength,
    pArtistText,
    pTitleText,
    fComment,
    cDirty
)
SELECT
    COALESCE(c.kDisc,-1),
    a.fTrack,
    a.fFrame,
    a.fLength,
    COALESCE(d.kText,-1),
    COALESCE(e.kText,-1),
    a.fExtT,
    a.cDirty
FROM
    iTrack a
    LEFT JOIN tDisc c ON a.fDiscNumber = c.fDiscNumber
    LEFT JOIN tText d ON a.fArtist = d.fText
    LEFT JOIN tText e ON a.fTitle = e.fText;

TRUNCATE TABLE iTrack;

COMMIT;

VACUUM ANALYZE tTrack;
VACUUM ANALYZE iTrack;


That is quite a lot of code to be run, and it really needs to be deployed in such a way that it can be run automatically. It also needs to be able to recover efficiently from the kinds of things that inevitably go wrong in a production environment. Depending on the power of the system on which it is hosted, it could take many hours to build the initial system, and a significant amount of time to update it regularly.

All the code that has been presented so far has been carefully designed and tested to run as quickly as possible on a wide variety of systems without sacrificing the kind of robustness that is required on a critical system. It has been tailored to use a predictable amount of memory, regardless of the size of the dataset, and to be able to quickly resume processing from where it was interrupted should that happen. Computation time can be valuable when schedules are tight, so it had better not be wasted.

However to automate all this requires a fairly sophisticated kind of program. In a later instalment I will present a system which can automatically handle all of these problems with ease.