/
References & Glossary for Web Crawler

References & Glossary for Web Crawler

  1. See URLs : Staring point for the process. Target the domain name and split the entire URL space into smaller ones.

  2. Url Frontier : to store URLs to be downloaded.

  3. DNS Resolver : Url is translated into an IP address and is called by Url Frontier.

  4. Content Parser : Web page must be validated because of malformed pages.

  5. Content Seen ? : Helps to detect new content previously stored in the system.

  6. Link Extractor (Url extractor) : Parses and extracts links from HTML pages.

  7. Url Filter : Excludes certain content types (error links, etc.).

  8. Url Seen ? : Data structure keeps track of URLs that are visited in the Url Frontier.

  9. Extracted links are filtered.

  10. URL seen component checks if URL is already in storage, if yes, nothing to be done.

  11. If URL has not been processed before, it is added to the URL Frontier.

 

Back-of-the-envelope

  • 1 Billion web pages downloaded per month

  • QPS: 1 000 000 000 / 30 days / 24 hours / 3600 sec = approx. 400 pages / second.

  • Peak QPS = 2 * QPS = 800 (because created & updated)

  • Average web page size = 500K

  • 1 billion page * 500 K = 500 TB storage per month.

  • Data stored for 5 years : 500 TB * 12 months * years = 30 PB.

 

 

Mapping table

Host

Queue

Host

Queue

mywebsite1.com

Q1

mywebsite2.com

Q2

mywebsiten.com

Qn