python crawler -- requests library using agent

Before reading this article, you need to master the knowledge and skills:

  • python Fundamentals
  • html Foundation
  • http status code

 

Let's take a look at the knowledge points in this article:

  • get method
  • post method
  • header parameter, simulating user
  • Data parameter, submit data
  • Proxies parameter, using proxies
  • Advanced learning

Install the requests library on

pip install requests

 

First, let's look at the help documents, see the introduction of requests, and use the help command of python

import requests
help(requests)

 

output:

Help on package requests:

NAME
    requests

DESCRIPTION
    Requests HTTP Library
    ~~~~~~~~~~~~~~~~~~~~~
    
    Requests is an HTTP library, written in Python, for human beings. Basic GET
    usage:
    
       >>> import requests
       >>> r = requests.get('https://www.python.org')
       >>> r.status_code
       200
       >>> 'Python is a programming language' in r.content
       True
    
    ... or POST:
    
       >>> payload = dict(key1='value1', key2='value2')
       >>> r = requests.post('https://httpbin.org/post', data=payload)
       >>> print(r.text)
       {
         ...
         "form": {
           "key2": "value2",
           "key1": "value1"
         },
         ...
       }
    
    The other HTTP methods are supported - see `requests.api`. Full documentation
    is at <http://python-requests.org>.
    
    :copyright: (c) 2017 by Kenneth Reitz.
    :license: Apache 2.0, see LICENSE for more details.

 

Here, we explain that the requests library is a human friendly http library written by python, and give an example of GET and POST methods.

GET method

OK, let's take Baidu as an example to test

import requests
r = requests.get('https://www.baidu.com')
print(r.status_code) #Print returned http code
print(r.text) #Print the text

 

Convenient point, cut a picture for you to see, the code returned is 200, indicating that the request is pulled back to the page normally.
Look at the returned text. It's a bit out of line. There are some html tags missing. At least Baidu has two words. Well, what's the reason,,,

 

 

It's believed that some students have thought that there is no real simulation of user's request. If you go to climb the data and don't simulate user's request, it will definitely limit you. At this time, you need to add a header parameter, at least a user agent. OK, let's find a ua. Don't Baidu, do it yourself, and have plenty of food and clothing. Teach you a way to use Google or Firefox developer tools.

Developer tools for Google browser

Open the new tab - press F12 - visit Baidu - find NetWork - open any one - scroll down - see ua, copy it.


 

 

import requests
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36'}
r = requests.get('https://www.baidu.com', headers=headers)
print(r.status_code)
print(r.text)

 

Well ~ ~ ~ there are a lot of data. Turn it down. It's normal. There are all data... PS: do you believe it? You can output an html file by yourself and open it in the browser

 

POST method

Just change get to post

import requests
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36'}
r = requests.post('https://www.baidu.com', headers=headers)
print(r.status_code)
print(r.text)

Run it. Generally, post is used to submit form information. Well, here, find a url that can submit data and go to post.
It's very convenient to use my own interface (PS: django), just copy it. Pay attention to the code. Data is the data to post. A data parameter is added to the post method.

import requests
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36'}
# post Data
data = {"info": "biu~~~ send post request"}
r = requests.post('http://dev.kdlapi.com/testproxy', headers=headers, data=data) #Plus one data parameter
print(r.status_code)
print(r.text)

 

Here's a screenshot for you to see: http code 200, post of body information is successful, and my own IP information and post data are returned

 

 

Use agent

Why use an agent? Generally, websites have a blocking restriction strategy. If you use your own IP to climb, you will not be able to access the website if it is blocked. At this time, you have to use proxy IP to solve the problem. Seal it. Anyway, it's not the local IP. It's the proxy IP.
Since you use a proxy, you need to find a proxy IP first. PS: it's too troublesome to write a proxy server by myself. The key is that I can't write either,,, hahaha

import requests

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36'}
# post Data
data = {"info": "biu~~~ send post request"}

# Agent information,Sponsored by express agent
proxy = '115.203.28.25:16584'
proxies = {
    "http": "http://%(proxy)s/" % {'proxy': proxy},
    "https": "http://%(proxy)s/" % {'proxy': proxy}
}

r = requests.post('http://dev.kdlapi.com/testproxy', headers=headers, data=data, proxies=proxies) #Plus one proxies parameter
print(r.status_code)
print(r.text)

 

The main method adds a proxies parameter, which uses the proxy IP.

 

Advanced learning

Tags: Python Windows Google pip

Posted on Mon, 13 Jan 2020 23:45:18 -0800 by helppy