How to design a Web Crawler

STEP 1 - Understand the problem and establish the scope

Main purpose : SEI, WA, … ?	Web Crawler for SEI
How many pages to collect per month ?	1 Billion pages
What content types ? HTML, PDF, etc.	HTML only.
Target : New content or updated content ?	new and updated content.
Store HTML pages ? How long ?	Yes, up to 5 years.
Duplicate content to handle ?	duplicate content to be ignored.
Back-of-the-envelope estimation in the page - References & Glossary

DFS (Depth-first search) or BFS (Breadth-first search) : BFS is commonly used by Web Crawlers and use a FIFO technic (First-in-First-out).

Purpose is to handle or to regulate the rate of server visits within a short period.

Queue Router (QR) ensures that each queue (Q1, Q2, .., Qn) only contains URLs from the same host.
QR uses a mapping table (see it here).
Each queue contains URLs from the same host (FIFO pattern implemented).
Queue Selector (QS) : each worker is mapped to a FIFO queue and the thread downloads web pages 1 by 1 from the same host.

Purpose is to prioritize and to be extremely efficient using parallelization.

Purpose is to provide a system robustness and to be able to handle the basic side (crash, malicious links, etc.).