Introduction
A web crawler is known as a robot or spider. Usually used by search engines to discover new or updated content on the web (Alex Xu) : web page, video, PDF file, etc.
Purposes :
Search engine indexing (SEI) : create local index for search engines.
Web archiving (WA) : collect info to preserve data for future uses.
Web mining (WMI) : data mining or useful knowledge from internet.
Web monitoring (WMO) : monitor copyright and trademark infringements over internet.
STEP 1 - Understand the problem and establish the scope
Main purpose : SEI, WA, … ? | Web Crawler for SEI |
---|---|
How many pages to collect per month ? | 1 Billion pages |
What content types ? HTML, PDF, etc. | HTML only. |
Target : New content or updated content ? | new and updated content. |
Store HTML pages ? How long ? | Yes, up to 5 years. |
Duplicate content to handle ? | duplicate content to be ignored. |
Back-of-the-envelope estimation in the page - References & Glossary |
STEP 2 - High-level design
Glossary explaining the boxes
See the page : References & Glossary.
Web Crawler Workflow
STEP 3 - Design deep dive
Starting point : Choose the right search algorithm
DFS (Depth-first search) or BFS (Breadth-first search) : BFS is commonly used by Web Crawlers and use a FIFO technic (First-in-First-out).
Handle too many requests (Url Frontier)
Purpose is to handle or to regulate the rate of server visits within a short period.
Queue Router (QR) ensures that each queue (Q1, Q2, .., Qn) only contains URLs from the same host.
QR uses a mapping table (see it here).
Each queue contains URLs from the same host (FIFO pattern implemented).
Queue Selector (QS) : each worker is mapped to a FIFO queue and the thread downloads web pages 1 by 1 from the same host.
Scalability to handle billions of web pages (Url Frontier)
Purpose is to prioritize and to be extremely efficient using parallelization.
Front queue manages prioritization;
Back queue manages quantity of requests.
Performance optimization (HTML Downloader & DNS Resolver)
Purpose is to provide a system robustness and to be able to handle the basic side (crash, malicious links, etc.).
STEP 4 - Pros & Cons
Pros
To distribute the load, we can use a consistent hashing helping to add or remove a downloader server.
Optimization possible, it’s like a throttling pattern.
Cons
Redundant content (nearly 30% of web pages are duplicates).
Web page could cause a crawler in an infinite loop.
Some contents have no value (spam URLs, advertisements, etc.)