Versions Compared
Version | Old Version 13 | New Version 14 |
---|---|---|
Changes made by | ||
Saved on |
Key
- This line was added.
- This line was removed.
- Formatting was changed.
Tip |
---|
The Framework proposed in this space (Alex Xu) is applied to propose a design : Getting started - a framework to propose... |
Introduction
A web crawler is known as a robot or spider. Usually used by search engines to discover new or updated content on the web (Alex Xu) : web page, video, PDF file, etc.
Purposes :
Search engine indexing (SEI) : create local index for search engines.
Web archiving (WA) : collect info to preserve data for future uses.
Web mining (WMI) : data mining or useful knowledge from internet.
Web monitoring (WMO) : monitor copyright and trademark infringements over internet.
On this page.
Table of Contents |
---|
STEP 1 - Understand the problem and establish the scope
Main purpose : SEI, WA, … ? | Web Crawler for SEI |
---|---|
How many pages to collect per month ? | 1 Billion pages |
What content types ? HTML, PDF, etc. | HTML only. |
Target : New content or updated content ? | new and updated content. |
Store HTML pages ? How long ? | Yes, up to 5 years. |
Duplicate content to handle ? | duplicate content to be ignored. |
STEP 2 - High-level design

Glossary explaining the boxes
See the page : References & Glossary.
Web Crawler Workflow

STEP 3 - Design deep dive
Starting point : Choose the right search algorithm
DFS (Depth-first search) or BFS (Breadth-first search) : BFS is commonly used by Web Crawlers and use a FIFO technic (First-in-First-out).

Handle too many requests (Url Frontier)
Purpose is to handle or to regulate the rate of server visits within a short period.

Scalability to handle billions of web pages (Url Frontier)
Purpose is to prioritize and to be extremely efficient using parallelization.
Performance optimization (HTML Downloader)
Purpose is to provide a system robustness and to be able to handle the bas side (crash, malicious links, etc.)