Spidering Hacks

Leseprobe

Fotogalerie

Kevin Hemenway, Tara Calishain

Spidering Hacks

Broschiertes Buch

Jetzt bewerten Jetzt bewerten

Andere Kunden interessierten sich auch für

Bastian Ballmann
Network Hacks - Intensivkurs

64,99 €
Amir Shevat
Designing Bots

30,99 €

Produktbeschreibung

The Internet, with its profusion of information, has made us hungry for ever more, ever better data. Out of necessity, many of us have become pretty adept with search engine queries, but there are times when even the most powerful search engines aren't enough. If you've ever wanted your data in a different form than it's presented, or wanted to collect data from several sites and see it side-by-side without the constraints of a browser, then Spidering Hacks is for you.

Spidering Hacks takes you to the next level in Internet data retrieval--beyond search engines--by showing you how to create spiders and bots to retrieve information from your favorite sites and data sources. You'll no longer feel constrained by the way host sites think you want to see their data presented--you'll learn how to scrape and repurpose raw data so you can view in a way that's meaningful to you.

Written for developers, researchers, technical assistants, librarians, and power users, Spidering Hacks provides expert tips on spidering and scraping methodologies. You'll begin with a crash course in spidering concepts, tools (Perl, LWP, out-of-the-box utilities), and ethics (how to know when you've gone too far: what's acceptable and unacceptable). Next, you'll collect media files and data from databases. Then you'll learn how to interpret and understand the data, repurpose it for use in other applications, and even build authorized interfaces to integrate the data into your own content. By the time you finish Spidering Hacks, you'll be able to:
* Aggregate and associate data from disparate locations, then store and manipulate the data as you like
* Gain a competitive edge in business by knowing when competitors' products are on sale, and comparing sales ranks and product placement on e-commerce sites
* Integrate third-party data into your own applications or web sites
* Make your own site easier to scrape and more usable to others
* Keep up-to-date with your favorite comics strips, news stories, stock tips, and more without visiting the site every day

Like the other books in O'Reilly's popular Hacks series, Spidering Hacks brings you 100 industrial-strength tips and tools from the experts to help you master this technology. If you're interested in data retrieval of any type, this book provides a wealth of data for finding a wealth of data.

Produktdetails

Produktdetails
Verlag: O'Reilly Media
Seitenzahl: 402
Erscheinungstermin: 2. Dezember 2003
Englisch
Abmessung: 230mm x 154mm x 29mm
Gewicht: 590g
ISBN-13: 9780596005771
ISBN-10: 0596005776
Artikelnr.: 12228416

Herstellerkennzeichnung
Libri GmbH
Europaallee 1
36244 Bad Hersfeld
gpsr@libri.de

Produktdetails

Verlag: O'Reilly Media
Seitenzahl: 402
Erscheinungstermin: 2. Dezember 2003
Englisch
Abmessung: 230mm x 154mm x 29mm
Gewicht: 590g
ISBN-13: 9780596005771
ISBN-10: 0596005776
Artikelnr.: 12228416

Herstellerkennzeichnung
Libri GmbH
Europaallee 1
36244 Bad Hersfeld
gpsr@libri.de

Autorenporträt

Kevin Hemenway, coauthor of Mac OS X Hacks, is better known as Morbus Iff, the creator of disobey.com, which bills itself as "content for the discontented." Publisher and developer of more home cooking than you could ever imagine, he'd love to give you a Fry Pan of Intellect upside the head. Politely, of course. And with love. Tara Calishain is the creator of the site, ResearchBuzz. She is an expert on Internet search engines and how they can be used effectively in business situations.

Inhaltsangabe

Credits
About the Authors
Contributors
Preface
Why Spidering Hacks?
How This Book Is Organized
How to Use This Book
Conventions Used in This Book
How to Contact Us
Got a Hack?
Chapter 1: Walking Softly
1.1 Hacks #1-7
1 A Crash Course in Spidering and Scraping
2 Best Practices for You and Your Spider
3 Anatomy of an HTML Page
4 Registering Your Spider
5 Preempting Discovery
6 Keeping Your Spider Out of Sticky Situations
7 Finding the Patterns of Identifiers
Chapter 2: Assembling a Toolbox
2.1 Hacks #8-32
2.2 Perl Modules
2.3 Resources You May Find Helpful
8 Installing Perl Modules
9 Simply Fetching with LWP::Simple
10 More Involved Requests with LWP::UserAgent
11 Adding HTTP Headers to Your Request
12 Posting Form Data with LWP
13 Authentication, Cookies, and Proxies
14 Handling Relative and Absolute URLs
15 Secured Access and Browser Attributes
16 Respecting Your Scrapee's Bandwidth
17 Respecting robots.txt
18 Adding Progress Bars to Your Scripts
19 Scraping with HTML::TreeBuilder
20 Parsing with HTML::TokeParser
21 WWW::Mechanize 101
22 Scraping with WWW::Mechanize
23 In Praise of Regular Expressions
24 Painless RSS with Template::Extract
25 A Quick Introduction to XPath
26 Downloading with curl and wget
27 More Advanced wget Techniques
28 Using Pipes to Chain Commands
29 Running Multiple Utilities at Once
30 Utilizing the Web Scraping Proxy
31 Being Warned When Things Go Wrong
32 Being Adaptive to Site Redesigns
Chapter 3: Collecting Media Files
3.1 Hacks #33-42
33 Detective Case Study: Newgrounds
34 Detective Case Study: iFilm
35 Downloading Movies from the Library of Congress
36 Downloading Images from Webshots
37 Downloading Comics with dailystrips
38 Archiving Your Favorite Webcams
39 News Wallpaper for Your Site
40 Saving Only POP3 Email Attachments
41 Downloading MP3s from a Playlist
42 Downloading from Usenet with nget
Chapter 4: Gleaning Data from Databases
4.1 Hacks #43-89
43 Archiving Yahoo! Groups Messages with yahoo2mbox
44 Archiving Yahoo! Groups Messages with WWW::Yahoo::Groups
45 Gleaning Buzz from Yahoo!
46 Spidering the Yahoo! Catalog
47 Tracking Additions to Yahoo!
48 Scattersearch with Yahoo! and Google
49 Yahoo! Directory Mindshare in Google
50 Weblog-Free Google Results
51 Spidering, Google, and Multiple Domains
52 Scraping Amazon.com Product Reviews
53 Receive an Email Alert for Newly Added Amazon.com Reviews
54 Scraping Amazon.com Customer Advice
55 Publishing Amazon.com Associates Statistics
56 Sorting Amazon.com Recommendations by Rating
57 Related Amazon.com Products with Alexa
58 Scraping Alexa's Competitive Data with Java
59 Finding Album Information with FreeDB and Amazon.com
60 Expanding Your Musical Tastes
61 Saving Daily Horoscopes to Your iPod
62 Graphing Data with RRDTOOL
63 Stocking Up on Financial Quotes
64 Super Author Searching
65 Mapping O'Reilly Best Sellers to Library Popularity
66 Using All Consuming to Get Book Lists
67 Tracking Packages with FedEx
68 Checking Blogs for New Comments
69 Aggregating RSS and Posting Changes
70 Using the Link Cosmos of Technorati
71 Finding Related RSS Feeds
72 Automatically Finding Blogs of Interest
73 Scraping TV Listings
74 What's Your Visitor's Weather Like?
75 Trendspotting with Geotargeting
76 Getting the Best Travel Route by Train
77 Geographic Distance and Back Again
78 Super Word Lookup
79 Word Associations with Lexical Freenet
80 Reformatting Bugtraq Reports
81 Keeping Tabs on the Web via Email
82 Publish IE's Favorites to Your Web Site
83 Spidering GameStop.com Game Prices
84 Bargain Hunting with PHP
85 Aggregating Multiple Search Engine Results
86 Robot Karaoke
87 Searching the Better Business Bureau
88 Searching for Health Inspections
89 Filtering for the Naughties
Chapter 5: Maintaining Your Collections
5.1 Hacks #90-93
90 Using cron to Automate Tasks
91 Scheduling Tasks Without cron
92 Mirroring Web Sites with wget and rsync
93 Accumulating Search Results Over Time
Chapter 6: Giving Back to the World
6.1 Hacks #94-100
94 Using XML::RSS to Repurpose Data
95 Placing RSS Headlines on Your Site
96 Making Your Resources Scrapable with Regular Expressions
97 Making Your Resources Scrapable with a REST Interface
98 Making Your Resources Scrapable with XML-RPC
99 Creating an IM Interface
100 Going Beyond the Book
Colophon

Inhaltsangabe

Rezensionen

"Spidering und Scraping? Ist das im Zeitalter von Web 2.0, RSS-Feeds und Webservices nicht so etwas von Web 1.0 (beta)? Braucht es dazu wirklich dieses Buch? [...] Ich fand es wichtig, spannend (ja, wirklich) und informativ. [...] Erst einmal führt es [...] in die Grundlagen ein und erinnert ausführlich daran, daß wir unseren Spidern 'Gutes Benehmen' beizubringen haben, wozu die Beachtung der robots.txt ebenso gehört, wie die Verpflichtung, den zu spidernden Server durch allzuhäufige Nachfragen nicht lahmzulegen. Und sauber identifizierbar sollte unser Spider ebenfalls sein. Danach geht es in medias res. Es wird eine (Perl-) Toolbox zusammengestellt, die uns das Spidern und Scrapen erleichtert. Aber auch auf andere Unix-Werkzeuge, wie z.B. wget oder lynx wird eingegangen. Und dann wird es erst recht interessant: Es werden Beispiele vorgeführt, wie man mit einfachen Mitteln Mashups baut [..]. Die Beispiele sind fast alle in Perl, vollständig dokumentiert und die Erklärungen sind witzig geschrieben und in einem Englisch, das auch ich sicher lesen kann. Um also auf die Eingangsfrage zurückzukommen: Ja, man braucht so ein Buch und wenn es so ein gut geschriebenes Buch ist wie Spidering Hacks, um so besser." - Schockwellenreiter.de, Juli 2006 Lesen Sie die ausführliche Rezension unter: http://www.schockwellenreiter.de/2006/07/26.html#ichHabeGelesenSpideringHacks

Zustand	Preis	Porto	Zahlung	Verkäufer	Rating
gebraucht; sehr gut	21,87 21,87 €	5,90 5,90 €	Banküberweisung, PayPal Zum Angebot	NEPO UG	98,1%	Zum Angebot
leichte Gebrauchsspuren	10,00 10,00 €	2,55 2,55 €	Banküberweisung, PayPal Zum Angebot	SadikMejid	99,9%	Zum Angebot