The use of Python Network request module Requests

Network request module

requests

introduce

The requests module can imitate the browser to send requests and get responses
requests module is common in python2 and python3
The requests module can help us extract the web content automatically

Installation of requests module

pip install requests

If you have two local environments, python2 and python3, and you want to install them in python3, it is recommended to install them in the following way

pip3 install requests

Use of requests module

Basic use

  • Usage mode
# Import module
import requests
# Define request address
url = 'http://www.baidu.com'
# Send GET request to GET response
response = requests.get(url)
# Get the html content of the response
html = response.text
  • Code explanation
  • response common properties
    • response.text returns the response content, which is of str type
    • Responses.content returns the response content, which is of type bytes
    • Response.status "codereturns the response status code
    • response.request.headers return request headers
    • response.headers return response headers
    • response.cookies returns the RequestsCookieJar object of the response
  • response.content conversion str type
# Get byte data
content = response.content
# Convert to string type
html = content.decode('utf-8')
  • response.cookies operation
# Return RequestsCookieJar object
cookies = response.cookies
# RequestsCookieJar to dict
requests.utils.dict_from_cookiejar(cookies)
# dict to requestscocookiejar
requests.utils.cookiejar_from_dict()
# Operate on the cookie and add a dictionary to the cookie jar
requests.utils.add_dict_to_cookiejar()

Custom request header

  • Usage mode
# Import module
import requests
# Define request address
url = 'http://www.baidu.com'
# Define custom request headers
headers = {
  "User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36"
}
# Send custom request header
response = requests.get(url,headers=headers)
# Get the html content of the response
html = response.text
  • Code explanation

Add headers parameter as custom request header when sending request

Send GET request

  • Usage mode
# Import module
import requests
# Define request address
url = 'http://www.baidu.com/s'
# Define custom request headers
headers = {
  "User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36"
}
# Define GET request parameters
params = {
  "kw":"hello"
}
# Send request using GET request parameter
response = requests.get(url,headers=headers,params=params)
# Get the html content of the response
html = response.text
  • Code explanation

params parameter as GET request parameter when sending request

Send POST request

  • Usage mode
# Import module
import requests
# Define request address
url = 'http://www.baidu.com'
# Define custom request headers
headers = {
  "User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36"
}
# Define post request parameters
data = {
  "kw":"hello"
}

# Send a request using the POST request parameter
response = requests.post(url,headers=headers,data=data)
# Get the html content of the response
html = response.text
  • Code explanation

data parameter as POST request parameter when sending request

Save pictures

  • Usage mode
# Import module
import requests
# Download Image address
url = "http://docs.python-requests.org/zh_CN/latest/_static/requests-sidebar.png"
# Send request to get response
response = requests.get(url)
# Save pictures
with open('image.png','wb') as f:
  f.write(response.content)

  • Code explanation

The suffix is the same as the requested suffix when saving the picture

Save must use response.content to save the file

Use proxy server

  • Effect
    • Make the server think it is not the same client requesting
    • Prevent our real address from being leaked and investigated
  • The process of using agents

  • Agent classification
  • Transparent proxy: Although transparent proxy can directly "hide" your IP address, it can still find out who you are.
  • Anonymous proxy: anonymous proxy is a little more advanced than transparent proxy: other people can only know that you use a proxy and can't know who you are.
  • Confusing proxies: the same as anonymous proxy, if you use confusing proxy, others can still know you are using proxy, but you will get a fake IP address, which is more lifelike in disguise
  • High anonymity proxy (Elite proxy or High Anonymity Proxy): it can be seen that high anonymity proxy makes it impossible for others to find out that you are using a proxy, so it is the best choice.

In use, there is no doubt that the high hiding agent is the best

From the protocol used: proxy ip can be divided into http proxy, https proxy, socket proxy, etc., which needs to be selected according to the protocol of website grabbing

  • Usage mode
# Import module
import requests
# Define request address
url = 'http://www.baidu.com'
# Define custom request headers
headers = {
  "User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36"
}
# Define a proxy server
proxies = {
  "http":"http://IP address: port number“,
  "https":"https://IP address: port number“
}
# Send a request using the POST request parameter
response = requests.get(url,headers=headers,proxies=proxies)
# Get the html content of the response
html = response.text
  • Code explanation

Set proxies parameter to proxy when sending request

Send request to bring Cookies

  • Usage mode

Carry cookies directly in the custom request header

Carry Cookie object by request parameter

  • Code
# Import module
import requests
# Define request address
url = 'http://www.baidu.com'
# Define custom request headers
headers = {
  "User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36"
  # Mode 1: carry Cookie content directly in the request header
  "Cookie": "Cookie value"
}
# Mode 2: define cookies
cookies = {
  "xx":"yy"
}
# Send a request using the POST request parameter
response = requests.get(url,headers=headers,cookies=cookies)
# Get the html content of the response
html = response.text
  • Code explanation

Cookies parameter carries cookies when sending request

Error certificate handling

  • Problem description

  • Usage mode

# Import module
import requests

url = "https://www.12306.cn/mormhweb/"
# Set ignore certificate
response = requests.get(url,verify=False)
  • Code explanation

If the verify parameter is set to False when sending the request, the CA certificate will not be verified

timeout handler

  • Usage mode
# Import module
import requests

url = "https://www.baidu.com"
# Set ignore certificate
response = requests.get(url,timeout=5)
  • Code explanation

Timeout parameter is set to timeout seconds when sending request

Retry processing

  • Usage mode
#!/usr/bin/python3
# -*- coding: utf-8 -*-
'''
//You can use the third-party module retrying module
1. pip install retrying

'''
import requests
# 1. Import module
from retrying import retry

# 2. Use decorator to retry setting
# Stop? Max? Attempt? Number indicates the number of retries
@retry(stop_max_attempt_number=3)
def parse_url(url):
    print("Visit url:",url)
    headers = {
        "User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36"
    }
    proxies = {
        "http":"http://124.235.135.210:80"
    }
    # Set timeout parameters
    response = requests.get(url,headers=headers,proxies=proxies,timeout=5)
    return response.text

if __name__ == '__main__':
    url = "http://www.baidu.com"
    try:
        html = parse_url(url)
        print(html)
    except Exception as e:
        # Record the url in the log file, analyze it manually in the future, and then re request the url
        print(e)
        
  • Code explanation
    Install retrying module

The retrying module can monitor a function through the decorator mode. If the function throws an exception, the retrying operation will be triggered

pip install retrying

  • Decorator settings for functions that need to be retried

Set the number of retries with the @ retry(stop_max_attempt_number = number of retries) parameter

# 1. Import module
from retrying import retry
# 2. Decorator setting retry function
@retry(stop_max_attempt_number=3)
def exec_func():
    pass

urllib

Using urlib network library in Python 3

#!/usr/bin/python3
# -*- coding: utf-8 -*-
import urllib.request

# 2. Initiate network request
# 2.1. Define request address
url = "https://github.com"
# 2.2. Custom request header
headers = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36",
    "Referer": "https://github.com/",
    "Host": "github.com"
}

# Define request object
req = urllib.request.Request(
    url=url,
    headers=headers
)

# Send request
resp = urllib.request.urlopen(req)

# Processing response
with open('github.txt', 'wb') as f:
    f.write(resp.read())

Precautions for using urlib

  • If it needs to be escaped in the URL
 #!/usr/bin/python3
 # -*- coding: utf-8 -*-

 # 1. Import module
 import urllib.request
 import urllib.parse

 # 2. Initiate request to get response

 wd = input("Please enter the query content:")

 # 2.1 define request address
 url = "https://www.baidu.com/s?wd="
 # 2.2 define custom request header
headers = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36",
    "Referer": "https://github.com/",
    "Host": "github.com"
}
 # 2.3 define request object
 request = urllib.request.Request(
     url=url + urllib.parse.quote(wd),
     headers=headers
 )
 # 2.4 send request
 response = urllib.request.urlopen(request)

 # 3. Handling response
 with open('02.html','wb') as f:
     f.write(response.read())
response.read() 
  • The return value is a byte string. To get the string content, decode is required
 html = response.read().decode('utf-8')

Reprinted from https://github.com/Kr1s77/Python-crawler-tutorial-starts-from-zero

Published 47 original articles, won praise 5, visited 2849
Private letter follow

Tags: Mac OS X github network

Posted on Sun, 08 Mar 2020 00:02:33 -0800 by SUNNY ARSLAN