The advent of low-cost mass-sequencing of genomes
presents significant data management difficulties.
These will grow worse as it becomes routine to
sequence the genomes of individual people and
organisms, because existing systems store and search
each genome separately. This approach is not
feasible for searching and
comparing the genomes of millions or billions of
individual organisms. This book seeks to solve this
problem by describing the DASH sequence alignment and
compression algorithms. DASH
makes use of the overwhelming similarities amongst
genomes of a given species in order to compress, not
only the database size, but also the index size and
search time. The resulting novel approach to
database compression, index compression,
bioinformatics and information-retrieval should be of
especial interest to anyone who has an interest in
the storage and efficient searching of large data
sets, whether DNA or any other subject which offers
some degree of redundancy, such as natural language
text or web pages.
presents significant data management difficulties.
These will grow worse as it becomes routine to
sequence the genomes of individual people and
organisms, because existing systems store and search
each genome separately. This approach is not
feasible for searching and
comparing the genomes of millions or billions of
individual organisms. This book seeks to solve this
problem by describing the DASH sequence alignment and
compression algorithms. DASH
makes use of the overwhelming similarities amongst
genomes of a given species in order to compress, not
only the database size, but also the index size and
search time. The resulting novel approach to
database compression, index compression,
bioinformatics and information-retrieval should be of
especial interest to anyone who has an interest in
the storage and efficient searching of large data
sets, whether DNA or any other subject which offers
some degree of redundancy, such as natural language
text or web pages.