Learn python-day02-03--From Python Distributed Crawler Creating Search Engine Scrapy

Section 347, Python Distributed Crawler Build Search Engine Scrapy Essentials - Randomly Change user-agent Browser user-agent through Download Middleware

Introduction to download Middleware
Middleware is a framework that can be connected to request/response processing.This is a very light, low-level system that can change Scrapy's requests and responses.In other words, the middleware between Requests request and Response response can modify Requests request and Response response globally

UserAgent Middleware() method, default Middleware

UserAgent Middleware () method under useragent.py in downloadmiddleware in source code, default Middleware

We can see from the source that the default User-Agent is Scrapy when Requests requests, which is easily identified by websites and intercepts Crawlers

We can modify the default middleware User Agent Middleware() to randomly change the User-Agent browser user agent for Requests request header information

Step 1, Open middleware registration in settings.py configuration file

DOWNLOADER_MIDDLEWARES={ }

Set the default UserAgent Middleware to None by default, or set it to maximum and execute last so that our custom middleware modifies the default user_agent and executes first

settings.py configuration file

Copy Code
# Enable or disable downloader middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html

DOWNLOADER_MIDDLEWARES = {              #Open Registration Middleware
   'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None, #Set the default UserAgent Middleware to None
}
//Copy Code

Step 2, Install the browser user agent module fake-useragent 0.1.7

fake-useragent is a module designed for crawlers to disguise browser User-Agent request headers.This module maintains various version libraries for each browser online and provides us with access to them

Online browser information: http://fake-useragent.herokuapp.com/browsers/0.1.7 0.1.7, where fake-useragent calls the browser proxy at random

Install this module first

pip install fake-useragent

Instructions:

Copy Code
#!/usr/bin/env python
# -*- coding:utf8 -*-

from fake_useragent import UserAgent  #Import Browser Proxy Module
ua = UserAgent()                      #Instantiate Browser Proxy Class

ua.ie                                 #Random Get IE Type Proxy
# Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US);
ua.msie                               #Random get msie type proxy, same below
# Mozilla/5.0 (compatible; MSIE 10.0; Macintosh; Intel Mac OS X 10_7_3; Trident/6.0)'
ua['Internet Explorer']
# Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0; GTB7.4; InfoPath.2; SV1; .NET CLR 3.3.69573; WOW64; en-US)
ua.opera
# Opera/9.80 (X11; Linux i686; U; ru) Presto/2.8.131 Version/11.11
ua.chrome
# Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.2 (KHTML, like Gecko) Chrome/22.0.1216.0 Safari/537.2'
ua.google
# Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_4) AppleWebKit/537.13 (KHTML, like Gecko) Chrome/24.0.1290.1 Safari/537.13
ua['google chrome']
# Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11
ua.firefox
# Mozilla/5.0 (Windows NT 6.2; Win64; x64; rv:16.0.1) Gecko/20121011 Firefox/16.0.1
ua.ff
# Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:15.0) Gecko/20100101 Firefox/15.0.1
ua.safari
# Mozilla/5.0 (iPad; CPU OS 6_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/6.0 Mobile/10A5355d Safari/8536.25

# and the best one, random via real world browser usage statistic
ua.random                               #Random access to browser-type proxies,
//Copy Code

Use more https://pypi.python.org/pypi/fake-useragent/0.1.7

Step 3, Customize the middleware to randomly change the User-Agent browser user agent for Requests request header information globally

In the middlewares.py file, customize the middleware

Copy Code
# -*- coding: utf-8 -*-

# Define here the models for your spider middleware
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/spider-middleware.html

from scrapy import signals
from fake_useragent import UserAgent    #Import Browser User Agent Module

class RequestsUserAgentmiddware(object):                                    #Custom Browser Proxy Middleware
    #User-Agent browser user agent that randomly changes Requests request header information
    def __init__(self,crawler):
        super(RequestsUserAgentmiddware, self).__init__()                   #Gets the object encapsulation value in the u init_u method of the parent base class
        self.ua = UserAgent()                                               #Instantiate Browser User Agent Module Class
        self.ua_type = crawler.settings.get('RANDOM_UA_TYPE','random')      #Get the browser type of the RANDOM_UA_TYPE configuration in the settings.py configuration file. If not, default random, randomly get various browser types

    @classmethod                                                            #The function is decorated with the @classmethod above, and the function has a required form parameter cls to receive the current class name
    def from_crawler(cls, crawler):                                         #Overload from_crawler method
        return cls(crawler)                                                 #Return crawler crawler to class

    def process_request(self, request, spider):                             #Overload process_request method
        def get_ua():                                                       #Custom function that returns browser information of the specified type in the browser proxy object
            return getattr(self.ua, self.ua_type)
        request.headers.setdefault('User-Agent', get_ua())                  #Add browser proxy information to Requests request
//Copy Code

Step 4, Register our custom Middleware in the settings.py configuration file, DOWNLOADER_MIDDLEWARES

Note that the default UserAgent Middleware should be set to None to make our custom middleware work

Copy Code
# Enable or disable downloader middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html

DOWNLOADER_MIDDLEWARES = {              #Open Registration Middleware
   'adc.middlewares.RequestsUserAgentmiddware': 543,
   'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None, #Set the default UserAgent Middleware to None
}
//Copy Code

We can interrupt and debug to see if it works

Schematic diagram

27 original articles published, 0 praised, 215 visits
Private letter follow

Tags: Windows Python Mac OS X

Posted on Mon, 17 Feb 2020 17:14:55 -0800 by bp90210