It is one thing to extract enough data from the freedb database to do some searches using SQL, but quite another to capture all the information required to reconstruct the complete database in a more efficient format. In particular, the files that record discs and tracks are stored in 11 different directories, and there are numerous hard links which connect records to "near miss" disc-ids. It would also be handy to retain the modification timestamps of records in the database, and for statistics junkies like me, the file sizes too.

The additional information can be obtained from the tar archive using the --list and --verbose options and then merged with the data extracted in the previous steps, but that requires a second pass and a great deal of additional processing. Instead, an enhanced version of the extraction program could read the meta information directly from the tar achive, especially as it is a simple file format and easy to decode. Here's a suitably enhanced version of the program that I prepared earlier: process-archive.c

This version of the program no longer needs tar to prepare the data for processing but it still must be decompressed.


make process-archive

bunzip2 < freedb-complete-20040501.tar.bz2 |\
./process-archive > freedb-complete-20040501.tab


The additional archive fields are added to the end of the tab delimited disc records, but a new record type is needed to hold the information about linked files in the archive. Note that the raw unix time (seconds since 1970) is extracted straight from the tar file and stored as an integer, rather than wasting processing time and storage converting it into a unix timestamp before it is needed.


CREATE TABLE iDisc
(
    fTrack TEXT NOT NULL, -- always blank for discs
    fDiscNumber INTEGER NOT NULL,
    fDiscLength INTEGER NOT NULL,
    fRevision INTEGER NOT NULL,
    fProcessor TEXT NOT NULL,
    fSubmitter TEXT NOT NULL,
    fDiscId TEXT NOT NULL,
    fDTitle TEXT NOT NULL,
    fDYear TEXT NOT NULL,
    fDGenre TEXT NOT NULL,
    fExtD TEXT NOT NULL,
    fFilePath TEXT NOT NULL,
    fSeconds INTEGER NOT NULL,
    fFileSize INTEGER NOT NULL
);

CREATE TABLE iLink
(
    fLinkPath TEXT NOT NULL,
    fTime INTEGER NOT NULL,
    fFilePath TEXT NOT NULL
);

CREATE TABLE iTrack
(
    fTrack INTEGER NOT NULL,
    fDiscNumber INTEGER NOT NULL,
    fFrame INTEGER NOT NULL,
    fTTitle TEXT NOT NULL,
    fExtT TEXT NOT NULL
);


As before, the three record types can be separated using grep and loaded directly into the import tables.


grep '^[^0-9a-z]' < freedb-complete-20040501.tab |\
psql --command "COPY iDisc FROM STDIN" freedb

grep '^[a-z]' < freedb-complete-20040501.tab |\
psql --command "COPY iLink FROM STDIN" freedb

grep '^[0-9]' < freedb-complete-20040501.tab |\
psql --command "COPY iTrack FROM STDIN" freedb

vacuumdb --analyze freedb


Now we have all the useful information that can possibly be obtained from the freedb tar archives. The next step is to break it down into optimal normal form and convert it into an efficient relational database design.