Search engines have become so indispensable that
they rank second only to e-mail as the most popular
online activity. To respond to queries in a timely
fashion, search engines make use of large indices of
word occurrences on Web pages to cross-reference
websites to keywords. Such indices are maintained by
spiders, a special kind of computer program that
browses the Web autonomously. However, due to a
variety of technological limitations, a single
spider has proven insufficient to maintain a search
engine's index. Hence, in this book, we review
several alternatives to split a spider's work into
multiple processes, and define a methodology to
preserve an up-to-date index of the Web.
SharpSpider, our prototype spider, has been
evaluated using the resources of PlanetLab, a
globally distributed platform for developing and
deploying planetary-scale services. Despite the
utilisation of very modest equipment, we have
performed large crawls of the Web, distributing the
workload amongst various computers spread across
different continents. The statistics derived from
our research offer valuable insight into the nature
of educational Web resources.
they rank second only to e-mail as the most popular
online activity. To respond to queries in a timely
fashion, search engines make use of large indices of
word occurrences on Web pages to cross-reference
websites to keywords. Such indices are maintained by
spiders, a special kind of computer program that
browses the Web autonomously. However, due to a
variety of technological limitations, a single
spider has proven insufficient to maintain a search
engine's index. Hence, in this book, we review
several alternatives to split a spider's work into
multiple processes, and define a methodology to
preserve an up-to-date index of the Web.
SharpSpider, our prototype spider, has been
evaluated using the resources of PlanetLab, a
globally distributed platform for developing and
deploying planetary-scale services. Despite the
utilisation of very modest equipment, we have
performed large crawls of the Web, distributing the
workload amongst various computers spread across
different continents. The statistics derived from
our research offer valuable insight into the nature
of educational Web resources.