Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Tip

The Framework proposed in this space (Alex Xu) is applied to propose a design : Getting started - a framework to propose...

Introduction

A web crawler is known as a robot or spider. Usually used by search engines to discover new or updated content on the web (Alex Xu) : web page, video, PDF file, etc.

Purposes :

  • Search engine indexing (SEI) : create local index for search engines.

  • Web archiving (WA) : collect info to preserve data for future uses.

  • Web mining (WMI) : data mining or useful knowledge from internet.

  • Web monitoring (WMO) : monitor copyright and trademark infringements over internet.

On this page.

Table of Contents

STEP 1 - Understand the problem and establish the scope

Main purpose : SEI, WA, … ?

Web Crawler for SEI

How many pages to collect per month ?

1 Billion pages

What content types ? HTML, PDF, etc.

HTML only.

Target : New content or updated content ?

new and updated content.

Store HTML pages ? How long ?

Yes, up to 5 years.

Duplicate content to handle ?

duplicate content to be ignored.

STEP 2 - High-level design

Glossary explaining the boxes

See the page : References & Glossary.

Web Crawler Workflow

STEP 3 - Design deep dive

Starting point : Choose the right search algorithm

DFS (Depth-first search) or BFS (Breadth-first search) : BFS is commonly used by Web Crawlers and use a FIFO technic (First-in-First-out).

Handle too many requests (Url Frontier)

Purpose is to handle or to regulate the rate of server visits within a short period.

Image Added

Scalability to handle billions of web pages (Url Frontier)

Purpose is to prioritize and to be extremely efficient using parallelization.

Performance optimization (HTML Downloader)

Purpose is to provide a system robustness and to be able to handle the bas side (crash, malicious links, etc.)

STEP 4 - Pros & Cons