python learning - Crawler skills

Article directory

requests module

The crawler's actions are all crawled in the source code corresponding to the web page you want

The essence is to simulate the customer's request, receive the network response of the other party, and grab the corresponding information according to the requirements of the programmer
(in theory, all the browsers can do, all the crawlers can do)

Reptile classification:
General purpose reptiles:
Search engine crawler

Focus on reptiles:
Site specific Crawlers

How search engines work:
Crawling web pages – storing data – preprocessing – providing search and ranking services

Types of requests to the website
get
Request in browser input box

post
Form input

Crawl Baidu's source code and save it locally.

import requests

def main():
    url = "https://www.baidu.com"
    headers = {"User-Agent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36"}
    responses = requests.get(url,headers=headers) #Request with header
    #print(responses.content) #Check the source code Baidu returned to me
    with open("baidu.txt","wb") as f:
        f.write(responses.content)
    pass

if __name__ == '__main__':
    main()

Responses. Status? Codeget status code
responses.headers get the response headers
responses.url get the URL of the response
responses.request.headers get request headers
Response.content get page code

About the request parameters in the url to execute the specific request.

import requests

def main():
    url = "https://www.baidu.com/s"
    #url = "https://www.baidu.com/s?wd={}".format("12306 ") 񖓿 if format is used, params is not needed

    headers = {"User-Agent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.10 Safari/537.35"}
    params = {"wd":"12306"} #The search key is 12306
    responses = requests.get(url,headers=headers,params=params) #With request parameters
    print(responses.request.url)
    with open("baidu.txt","wb")as f:
        f.write(responses.content)
    pass

if __name__ == '__main__':
    main()

Baidu tieba crawler

mport requests

def main():
    tieba_name = input("Please enter the name of the post bar:")
    url_temp = "http://tieba.baidu.com/f?kw="+tieba_name+"&ie=utf-8&pn={}"
    url_list = [] #Store url list
    #Climb 2 pages
    for i in range(2):
        url_list.append(url_temp.format(i*50))
    #Define the headers
    headers = {"User-Agent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36"}
    #Traverse the contents of URL list to send data
    for url in url_list:
        resp = requests.get(url,headers=headers)
        with open("a.txt","wb+") as f:
            f.write(resp.content)
    pass

if __name__ == '__main__':
    main()

Send post request

import requests

def main():
    url = "http://njeclissi.com:81/Less-20/index.php"
    header = {"User-Agent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36"}
    data = {"uname":"admin","passwd":"admin"} #Define the data of post
    resp = requests.post(url,headers=header,data=data) #Send post data
    print(resp.content)
    pass

if __name__ == '__main__':
    main()

Anti reptile

Means of anti crawler program
Prevent real addresses from being tracked

proxys = {"http":"http://1.2.3.4:9999"}
resp = requests.post(url,headers=header,data=data,proxys=proxys)

anonymous proxy
They know you use a proxy, but they don't know who you are

confused deputy
I know you use a proxy, but people get a fake ip address

High hiding agent
I don't know if you use an agent

Defense reptile detection:
ip traffic for a period of time
cookie,user-agent,headers,referer

Processing cookie

import requests

def main():
    url = "http://njeclissi.com:81/Less-20/index.php"
    header = {"User-Agent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36","Cookie":"uname=admin"}
    resp = requests.get(url,headers=header)
    print(resp.content)
    pass

if __name__ == '__main__':
    main()

Get cookie s after post login

cookies = requests.utils.dict_from_cookiejar(resp.cookies)
print(cookies)

import requests

def main():
    url = "http://njeclissi.com:81/Less-20/index.php"
    header = {"User-Agent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36"}
    #data = {"uname":"admin","passwd":"admin"} #Define the data of post
    cookie = {"uname":"admin"}
    resp = requests.get(url,headers=header,cookies=cookie)
    print(resp.content)
    pass

if __name__ == '__main__':
    main()
Published 12 original articles, won praise 1, visited 140
Private letter follow

Tags: Windows less PHP network

Posted on Tue, 04 Feb 2020 09:32:55 -0800 by krraleigh