Versions Compared
Version | Old Version 10 | New Version 11 |
---|---|---|
Changes made by | ||
Saved on |
Key
- This line was added.
- This line was removed.
- Formatting was changed.
Tip |
---|
The Framework proposed in this space (Alex Xu) is applied to propose a design : Getting started - a framework to propose... |
Introduction
A web crawler is known as a robot or spider. Usually used by search engines to discover new or updated content on the web (Alex Xu) : web page, video, PDF file, etc.
Purposes :
Search engine indexing (SEI) : create local index for search engines.
Web archiving (WA) : collect info to preserve data for future uses.
Web mining (WMI) : data mining or useful knowledge from internet.
Web monitoring (WMO) : monitor copyright and trademark infringements over internet.
On this page.
Table of Contents |
---|
STEP 1 - Understand the problem and establish the scope
Main purpose : SEI, WA, … ? | Web Crawler for SEI |
---|---|
How many pages to collect per month ? | 1 Billion pages |
What content types ? HTML, PDF, etc. | HTML only. |
Target : New content or updated content ? | new and updated content. |
Store HTML pages ? How long ? | Yes, up to 5 years. |
Duplicate content to handle ? | duplicate content to be ignored. |
STEP 2 - High-level design

Glossary explaining the boxes
See the page : References & Glossary.
Web Crawler Workflow
