Web Crawler

A Web crawler is an Internet bot that systematically browses the World Wide Web using the Internet Protocol Suite.

Web Crawlers are useful in Machine Learning for collecting data that can be used for Modeling Processes such as training and prediction processing.

Python Example

This example uses the Scrapy library.

To download the code below, click here.

Run the scrapy web crawler using the following terminal command:

scrapy runspider web_crawler.py
# Import the scrapy library which will need to be installed on your system.
import scrapy
 
# Define a class, data, and methods for the scraping process.
class BlogSpider(scrapy.Spider):
    name = 'blogspider'
    
    # Define the url of the webpage for starting the scrape.
    start_urls = ['https://blog.scrapinghub.com']
  
    # Define the method used for parsing the scrape responses.
    def parse(self, response):
  
        # Process text from the web page css title.
        for title in response.css('.post-header>h2'):
            yield {'title': title.css('a ::text').get( )}
  
        # Get the next link to follow.
        for next_page in response.css('a.next-posts-link'):
            yield response.follow(next_page, self.parse)