Introduction
A web crawler is known as a robot or spider. Usually used by search engines to discover new or updated content on the web (Alex Xu) : web page, video, PDF file, etc.
...
Search engine indexing (SEI) : create local index for search engines.
Web archiving (WA) : collect info to preserve data for future uses.
Web mining (WMI) : data mining or useful knowledge from internet.
Web monitoring (WMO) : monitor copyright and trademark infringements over internet.
STEP 1 - Understand the problem and establish the scope
Main purpose : SEI, WA, … ? | Web Crawler for SEI |
---|---|
How many pages to collect per month ? | 1 Billion pages |
What content types ? HTML, PDF, etc. | HTML only. |
Target : New content or updated content ? | new and updated content. |
Store HTML pages ? How long ? | Yes, up to 5 years. |
Duplicate content to handle ? | duplicate content to be ignored. |
Back-of-the-envelope estimation in the page - References & Glossary |
STEP 2 - High-level design
...
Glossary explaining the boxes
See the page : References & Glossary.
Web Crawler Workflow
...
STEP 3 - Design deep dive
Starting point : Choose the right search algorithm
...
Queue Router (QR) ensures that each queue (Q1, Q2, .., Qn) only contains URLs from the same host.
QR uses a mapping table (see it here).
Each queue contains URLs from the same host (FIFO pattern implemented).
Queue Selector (QS) : each worker is mapped to a FIFO queue and the thread downloads web pages 1 by 1 from the same host.
...
Purpose is to provide a system robustness and to be able to handle the basic side (crash, malicious links, etc.).
...
STEP 4 - Pros & Cons
Pros
To distribute the load, we can use a consistent hashing helping to add or remove a downloader server.
Optimization possible, it’s like a throttling pattern.
...