Pthon Learning Crawler Network Data Collection

Python gives the impression that it's very convenient to grab web pages, and it provides this productivity mainly through
urllib, requests.

urllib for network data collection

urllib Library
Official Document Address: https://docs.python.org/3/library/urllib.html
The urllib library is python's built-in HTTP request library that contains the following modules:
(1)urllib.request: request module
(2)urllib.error: exception handling module
(3)urllib.parse: parsing module
(4)urllib.robotparser:robots.txt parsing module

urllib library: urlopen
urlopen makes simple site requests and does not support complex features such as authentication, cookie s, and other HTTP advanced features.
To support these functions, you must use the OpenerDirector object returned by the build_opener() function.

urllib library: request website after User-Agent disguises
Many websites need to carry some headers header information in order to prevent crawlers from crawling websites into paralysis
Access, we can specify the request header information through the urllib.request.Request object

requests library for network data collection

requests Library
Official web address for requests: https://requests.readthedocs.io/en/master/
Requests is an elegant and simple HTTP library for Python, built for human
beings.
request method summary

The Response object contains all the information returned by the server as well as the requested Request information.

reqursts.py

from urllib.error import HTTPError

import requests

def get():
    # The get method can either get page data or submit non-sensitive data
    #url = 'http://127.0.0.1:5000/'
    #url = 'http://127.0.0.1:5000/?username=fentiao&page=1&per_page=5'
    url = 'http://127.0.0.1:5000/'
    try:
        params = {
            'username': 'fentiao',
            'page': 1,
            'per_page': 5
        }
        response = requests.get(url, params=params)
        print(response.text, response.url)
        #print(response)
        #print(response.status_code)
        #print(response.text)
        #print(response.content)
        #print(response.encoding)
    except HTTPError as e:
        print("Crawl Crawl%s fail: %s" % (url, e.reason))

def post():
    url = 'http://127.0.0.1:5000/post'
    try:
        data = {
            'username': 'admin',
            'password': 'westos12'
        }
        response = requests.post(url, data=data)
        print(response.text)
    except HTTPError as e:
        print("Crawl Crawl%s fail: %s" % (url, e.reason))

if __name__ == '__main__':
    get()
    #post()

Advanced Application 1: Add headers

Some websites must be accessed with information such as a browser. If headers are not passed in, an error will occur.
headers = { 'User-Agent': useragent}
response = requests.get(url, headers=headers)
UserAgent is a string that identifies the browser, which is equivalent to the browser's identity card. When crawling Web site data using a crawler,
Frequent UserAgent changes can avoid triggering the corresponding anti-crawl mechanism.fake-useragent for frequently changing UserAgents
Good support, anti-pickpocket tools.
user_agent = UserAgent().random

import requests
from fake_useragent import  UserAgent

def add_headers():
    # headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:74.0) Gecko/20100101 Firefox/74.0'}
    #UserAgent essentially obtains all user agents from the network and random ly selects one.
    #https://fake-useragent.herokuapp.com/browsers/0.1.11
    ua = UserAgent()
    #By default, the user agent for Python crawlers is python-requests/2.22.0.
    response = requests.get('http://127.0.0.1:5000', headers={'User-Agent': ua.random})
    print(response)

if __name__ == '__main__':
    add_headers()

Advanced Application 2: IP Proxy Settings

When crawling, sometimes the crawler is blocked by the server. The main method used at this time is to reduce access.
Interval, accessed through proxy IP.IP can be grabbed from the Internet or purchased from a treasure.
proxies = { "http": "http://127.0.0.1:9743", "https": "https://127.0.0.1:9743",}
response = requests.get(url, proxies=proxies)
Baidu keyword interface:
https://www.baidu.com/baidu?wd=xxx&tn=monline_4_dg
360 keyword interface: http://www.so.com/s?q=keyword

import requests
from fake_useragent import UserAgent

ua = UserAgent()
proxies = {
    'http': 'http://222.95.144.65:3000',
    'https': 'https://182.92.220.212:8080'
}
response = requests.get('http://47.92.255.98:8000',
                        headers={'User-Agent': ua.random},
                        proxies=proxies
                        )

print(response)
#This is because the server side returns data: get submitted data and requested client IP
#How do you determine success? The client IP returned happens to be the proxy IP, representing success.
print(response.text)

Tags: Python network encoding Windows

Posted on Mon, 13 Apr 2020 15:16:28 -0700 by jschofield