Millions of raw data files plus derived sequence/qual score files
Use of a cluster of multi-cpu Solaris servers, 4 Gb RAM each, shared 6 Tb disk (mirrored
or RAID5)
Suite of perl scripts to organize orginally, calling in-house and PD s/w
Data integrity issues - silent corruption experienced. Hence an integrity checker,
run in batch mode weekly (DB vs file system vs ASCII index files)
Filters out unreadable files, sequences too short, names that violate conventions
(==> lab screw up), etc.
In early 2002, about 14,000,000 chromatograms in Archive & DB; DB was ~100 Gb,
biggest table was 19,000,000 rows
Different species, tissues, processes, runs on the same original sample
Production doubled every year
Backup/restore issues. Company's "crown jewels"
Later, an Oracle DB to organize all, Pro*C s/w to enter/extract data
Extracts sequence and qual scores (statistical measure of how good each base call is)