Many types of sequential symbolic data possess
structure that is (i) hierarchical, and (ii) context-
sensitive. Natural-language text or transcribed
speech are prime examples of such data: a corpus of
language consists of sentences, defined over a
finite lexicon of symbols such as words. Linguists
traditionally analyze the sentences into recursively
structured phrasal constituents; at the same time, a
distributional analysis of partially aligned
sentential contexts reveals in the lexicon clusters
that are said to correspond to various syntactic
categories (such as nouns or verbs). Such structure,
however, is not limited to the natural languages:
recurring motifs are found, on a level of
description that is common to all life on earth, in
the base sequences of DNA that constitute the
genome. In this book, I address the problem of
extracting patterns from natural sequential data and
inferring the underlying rules that govern their
production. This is relevant to both linguistics and
bioinformatics, two fields that investigate
sequential symbolic data that are hierarchical and
context sensitive.
structure that is (i) hierarchical, and (ii) context-
sensitive. Natural-language text or transcribed
speech are prime examples of such data: a corpus of
language consists of sentences, defined over a
finite lexicon of symbols such as words. Linguists
traditionally analyze the sentences into recursively
structured phrasal constituents; at the same time, a
distributional analysis of partially aligned
sentential contexts reveals in the lexicon clusters
that are said to correspond to various syntactic
categories (such as nouns or verbs). Such structure,
however, is not limited to the natural languages:
recurring motifs are found, on a level of
description that is common to all life on earth, in
the base sequences of DNA that constitute the
genome. In this book, I address the problem of
extracting patterns from natural sequential data and
inferring the underlying rules that govern their
production. This is relevant to both linguistics and
bioinformatics, two fields that investigate
sequential symbolic data that are hierarchical and
context sensitive.