Content Comparison

Tip
The Framework proposed in this space (Alex Xu) is applied to propose a design : Getting started - a framework to propose...

Introduction

A web crawler is known as a robot or spider. Usually used by search engines to discover new or updated content on the web (Alex Xu) : web page, video, PDF file, etc.

Purposes :

Search engine indexing (SEI) : create local index for search engines.
Web archiving (WA) : collect info to preserve data for future uses.
Web mining (WMI) : data mining or useful knowledge from internet.
Web monitoring (WMO) : monitor copyright and trademark infringements over internet.

On this page.

Table of Contents

STEP 1 - Understand the problem and establish the scope

Main purpose : SEI, WA, … ?	Web Crawler for SEI
How many pages to collect per month ?	1 Billion pages
What content types ? HTML, PDF, etc.	HTML only.
Target : New content or updated content ?	new and updated content.
Store HTML pages ? How long ?	Yes, up to 5 years.
Duplicate content to handle ?	duplicate content to be ignored.

Version	Old Version 10	New Version 11
Changes made by	thierry sinassamy	thierry sinassamy
Saved on	Jun 02, 2022	Jun 02, 2022

Versions Compared

Key

Introduction

Purposes :

STEP 1 - Understand the problem and establish the scope

STEP 2 - High-level design

Glossary explaining the boxes

Web Crawler Workflow

STEP 3 - Design deep dive

STEP 4 - Pros & Cons