Crawling & Scraping
A Web crawler starts with a list of URLs to visit, called the seeds. As the crawler visits these URLs, it identifies all the hyperlinks in the page and adds them to the list of URLs to visit, called the crawl frontier. URLs from the frontier are recursively visited according to a set of policies. If the crawler is performing archiving of websites it copies and saves the information as it goes. The archives are usually stored in such a way they can be viewed, read and navigated as they were on the live web, but are preserved as ‘snapshots'. The large volume implies the crawler can only download a limited number of the Web pages within a given time, so it needs to prioritize its downloads. The high rate of change can imply the pages might have already been updated or even deleted.
Vonec Technologies provides custom Crawling services to businesses of various sizes.
Features of Crawling
- Selection policy which states the pages to download.
- Re-visit policy which states when to check for changes to the pages.
- Politeness policy that states how to avoid overloading Web sites.
- Parallelization policy that states how to coordinate distributed web crawlers.