Content Comparison

Tip
The Framework proposed in this space (Alex Xu) is applied to propose a design : Getting started - a framework to propose...

Introduction

A web crawler is known as a robot or spider. Usually used by search engines to discover new or updated content on the web (Alex Xu) : web page, video, PDF file, etc.

Purposes :

Search engine indexing (SEI) : create local index for search engines.
Web archiving (WA) : collect info to preserve data for future uses.
Web mining (WMI) : data mining or useful knowledge from internet.
Web monitoring (WMO) : monitor copyright and trademark infringements over internet.

On this page.

Table of Contents

STEP 1 - Understand the problem and establish the scope

Main purpose : SEI, WA, … ?	Web Crawler for SEI
How many pages to collect per month ?	1 Billion pages
What content types ? HTML, PDF, etc.	HTML only.
Target : New content or updated content ?	new and updated content.
Store HTML pages ? How long ?	Yes, up to 5 years.
Duplicate content to handle ?	duplicate content to be ignored.

STEP 3 - Design deep dive

Starting point : Choose the right search algorithm

DFS (Depth-first search) or BFS (Breadth-first search) : BFS is commonly used by Web Crawlers and use a FIFO technic (First-in-First-out).

Handle too many requests (Url Frontier)

Purpose is to handle or to regulate the rate of server visits within a short period.

Image Added

Scalability to handle billions of web pages (Url Frontier)

Purpose is to prioritize and to be extremely efficient using parallelization.

Performance optimization (HTML Downloader)

Purpose is to provide a system robustness and to be able to handle the bas side (crash, malicious links, etc.)

Version	Old Version 13	New Version 14
Changes made by	thierry sinassamy	thierry sinassamy
Saved on	Jun 02, 2022	Jun 02, 2022

Versions Compared

Key

Introduction

Purposes :

STEP 1 - Understand the problem and establish the scope

STEP 2 - High-level design

Glossary explaining the boxes

Web Crawler Workflow

STEP 3 - Design deep dive

Starting point : Choose the right search algorithm

Handle too many requests (Url Frontier)

Scalability to handle billions of web pages (Url Frontier)

Performance optimization (HTML Downloader)

STEP 4 - Pros & Cons