Skip to end of metadata
Go to start of metadata

You are viewing an old version of this content. View the current version.

Compare with Current View Version History

« Previous Version 15 Next »

The Framework proposed in this space (Alex Xu) is applied to propose a design : Getting started - a framework to propose...

Introduction

A web crawler is known as a robot or spider. Usually used by search engines to discover new or updated content on the web (Alex Xu) : web page, video, PDF file, etc.

Purposes :

  • Search engine indexing (SEI) : create local index for search engines.

  • Web archiving (WA) : collect info to preserve data for future uses.

  • Web mining (WMI) : data mining or useful knowledge from internet.

  • Web monitoring (WMO) : monitor copyright and trademark infringements over internet.

On this page.

STEP 1 - Understand the problem and establish the scope

Main purpose : SEI, WA, … ?

Web Crawler for SEI

How many pages to collect per month ?

1 Billion pages

What content types ? HTML, PDF, etc.

HTML only.

Target : New content or updated content ?

new and updated content.

Store HTML pages ? How long ?

Yes, up to 5 years.

Duplicate content to handle ?

duplicate content to be ignored.

STEP 2 - High-level design

Glossary explaining the boxes

See the page : References & Glossary.

Web Crawler Workflow

STEP 3 - Design deep dive

Starting point : Choose the right search algorithm

DFS (Depth-first search) or BFS (Breadth-first search) : BFS is commonly used by Web Crawlers and use a FIFO technic (First-in-First-out).

Handle too many requests (Url Frontier)

Purpose is to handle or to regulate the rate of server visits within a short period.

  • Queue Router (QR) ensures that each queue (Q1, Q2, .., Qn) only contains URLs from the same host.

  • QR uses a mapping table (see it here).

Scalability to handle billions of web pages (Url Frontier)

Purpose is to prioritize and to be extremely efficient using parallelization.

Performance optimization (HTML Downloader)

Purpose is to provide a system robustness and to be able to handle the bas side (crash, malicious links, etc.)

STEP 4 - Pros & Cons

  • No labels