Content Comparison

Introduction

A web crawler is known as a robot or spider. Usually used by search engines to discover new or updated content on the web (Alex Xu) : web page, video, PDF file, etc.

...

Search engine indexing (SEI) : create local index for search engines.
Web archiving (WA) : collect info to preserve data for future uses.
Web mining (WMI) : data mining or useful knowledge from internet.
Web monitoring (WMO) : monitor copyright and trademark infringements over internet.

STEP 1 - Understand the problem and establish the scope

Main purpose : SEI, WA, … ?	Web Crawler for SEI
How many pages to collect per month ?	1 Billion pages
What content types ? HTML, PDF, etc.	HTML only.
Target : New content or updated content ?	new and updated content.
Store HTML pages ? How long ?	Yes, up to 5 years.
Duplicate content to handle ?	duplicate content to be ignored.
Back-of-the-envelope estimation in the page - References & Glossary

STEP 2 - High-level design

...

Glossary explaining the boxes

See the page : References & Glossary.

Web Crawler Workflow

...

STEP 3 - Design deep dive

Starting point : Choose the right search algorithm

...

Queue Router (QR) ensures that each queue (Q1, Q2, .., Qn) only contains URLs from the same host.
QR uses a mapping table (see it here).
Each queue contains URLs from the same host (FIFO pattern implemented).
Queue Selector (QS) : each worker is mapped to a FIFO queue and the thread downloads web pages 1 by 1 from the same host.

...

Purpose is to provide a system robustness and to be able to handle the basic side (crash, malicious links, etc.).

...

STEP 4 - Pros & Cons

Pros

To distribute the load, we can use a consistent hashing helping to add or remove a downloader server.
Optimization possible, it’s like a throttling pattern.

...

Version	Old Version 25	New Version Current
Changes made by	thierry sinassamy	thierry sinassamy
Saved on	Aug 24, 2022	Aug 24, 2022

Versions Compared

Key