Skip to end of metadata
Go to start of metadata

You are viewing an old version of this content. View the current version.

Compare with Current View Version History

« Previous Version 22 Next »

The Framework proposed in this space (Alex Xu) is applied to propose a design : Getting started - a framework to propose...

Introduction

A web crawler is known as a robot or spider. Usually used by search engines to discover new or updated content on the web (Alex Xu) : web page, video, PDF file, etc.

Purposes :

  • Search engine indexing (SEI) : create local index for search engines.

  • Web archiving (WA) : collect info to preserve data for future uses.

  • Web mining (WMI) : data mining or useful knowledge from internet.

  • Web monitoring (WMO) : monitor copyright and trademark infringements over internet.

On this page.

STEP 1 - Understand the problem and establish the scope

Main purpose : SEI, WA, … ?

Web Crawler for SEI

How many pages to collect per month ?

1 Billion pages

What content types ? HTML, PDF, etc.

HTML only.

Target : New content or updated content ?

new and updated content.

Store HTML pages ? How long ?

Yes, up to 5 years.

Duplicate content to handle ?

duplicate content to be ignored.

Back-of-the-envelope estimation in the page - References & Glossary

STEP 2 - High-level design

Glossary explaining the boxes

See the page : References & Glossary.

Web Crawler Workflow

STEP 3 - Design deep dive

Starting point : Choose the right search algorithm

DFS (Depth-first search) or BFS (Breadth-first search) : BFS is commonly used by Web Crawlers and use a FIFO technic (First-in-First-out).

Handle too many requests (Url Frontier)

Purpose is to handle or to regulate the rate of server visits within a short period.

  • Queue Router (QR) ensures that each queue (Q1, Q2, .., Qn) only contains URLs from the same host.

  • QR uses a mapping table (see it here).

  • Each queue contains URLs from the same host (FIFO pattern implemented).

  • Queue Selector (QS) : each worker is mapped to a FIFO queue and the thread downloads web pages 1 by 1 from the same host.

Scalability to handle billions of web pages (Url Frontier)

Purpose is to prioritize and to be extremely efficient using parallelization.

  • Front queue manages prioritization;

  • Back queue manages quantity of requests.

Performance optimization (HTML Downloader & DNS Resolver)

Purpose is to provide a system robustness and to be able to handle the basic side (crash, malicious links, etc.).

STEP 4 - Pros & Cons

Pros

  • To distribute the load, we can use a consistent hashing helping to add or remove a downloader server.

  • Optimization possible, it’s like a throttling pattern.

Cons

  • Redundant content (nearly 30% of web pages are duplicates).

  • Web page could cause a crawler in an infinite loop.

  • Some contents have no value (spam URLs, advertisements, etc.)

  • No labels