Tutorial for novice crawlers (python proxy IP)

 

Preface

Python crawler has to go through the process of crawler, restricted crawler and anti restricted crawler. Of course, in the future, we also need to optimize the web crawler restrictions, and the process of the crawler anti restriction is a series of steps. In the initial stage of crawler, adding headers and ip proxy can solve many problems.

PS: if you need Python learning materials, you can click the link below to get http://t.cn/A6Zvjdun

When I was climbing to read Douban, I thought that the number of crawling was too many, and I was directly blocked by IP. Later, I studied the problem of proxy IP

(I didn't know what happened at that time, and I almost collapsed...). Let me introduce the problem of my proxy IP crawling data. Please point out the shortcomings

 

problem

This is my IP is blocked. At the beginning, I thought it was my code problem

 

Train of thought:

From the Internet, I found some information about the crawler agent IP, and got the following ideas

  1. Crawl some IP addresses and filter them out
  2. Add the corresponding IP in the proxies parameter of requests
  3. Continue to climb
  4. Knock off
  5. Well, it's all bullshit. We all know the theory. The code is directly on it

With the train of thought in hand

Running environment

Python 3.7, Pycharm

We need to build a good environment directly

Preparation

  1. Website crawling IP address (domestic high hiding agent)
  2. Website for verifying IP address
  3. The py crawler script you were blocked with IP before

The above URL is selected according to the personal situation

Crawl the complete code of IP

PS: it's easy to use bs4 to get IP and port number. There's no difficulty. A logic is added to filter the unavailable IP

There are notes on key points

import requests
from bs4 import BeautifulSoup
import json
 
 
class GetIp(object):
 """Grab agent IP"""
 
 def __init__(self):
 """initialize variable"""
 self.url = 'http://www.xicidaili.com/nn/'
 self.check_url = 'https://www.ip.cn/'
 self.ip_list = []
 
 @staticmethod
 def get_html(url):
 """request html Page information"""
 header = {
 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'
 }
 try:
 request = requests.get(url=url, headers=header)
 request.encoding = 'utf-8'
 html = request.text
 return html
 except Exception as e:
 return ''
 
 def get_available_ip(self, ip_address, ip_port):
 """Testing IP Address available or not"""
 header = {
 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'
 }
 ip_url_next = '://' + ip_address + ':' + ip_port
 proxies = {'http': 'http' + ip_url_next, 'https': 'https' + ip_url_next}
 try:
 r = requests.get(self.check_url, headers=header, proxies=proxies, timeout=3)
 html = r.text
 except:
 print('fail-%s' % ip_address)
 else:
 print('success-%s' % ip_address)
 soup = BeautifulSoup(html, 'lxml')
 div = soup.find(class_='well')
 if div:
 print(div.text)
 ip_info = {'address': ip_address, 'port': ip_port}
 self.ip_list.append(ip_info)
 
 def main(self):
 """Main method"""
 web_html = self.get_html(self.url)
 soup = BeautifulSoup(web_html, 'lxml')
 ip_list = soup.find(id='ip_list').find_all('tr')
 for ip_info in ip_list:
 td_list = ip_info.find_all('td')
 if len(td_list) > 0:
 ip_address = td_list[1].text
 ip_port = td_list[2].text
 # Check whether the IP address is valid
 self.get_available_ip(ip_address, ip_port)
 # Write valid file
 with open('ip.txt', 'w') as file:
 json.dump(self.ip_list, file)
 print(self.ip_list)
 
 
# Program main entrance
if __name__ == '__main__':
 get_ip = GetIp()
 get_ip.main()

Use method complete code

PS: it mainly uses random IP to crawl, and judges whether the IP can be used according to the request status

Why do you have to judge like this?

The main reason is that although it is filtered, it doesn't mean it can be used when you climb, so you have to make a more judgment

from bs4 import BeautifulSoup
import datetime
import requests
import json
import random
 
ip_random = -1
article_tag_list = []
article_type_list = []
 
 
def get_html(url):
 header = {
 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36'
 }
 global ip_random
 ip_rand, proxies = get_proxie(ip_random)
 print(proxies)
 try:
 request = requests.get(url=url, headers=header, proxies=proxies, timeout=3)
 except:
 request_status = 500
 else:
 request_status = request.status_code
 print(request_status)
 while request_status != 200:
 ip_random = -1
 ip_rand, proxies = get_proxie(ip_random)
 print(proxies)
 try:
 request = requests.get(url=url, headers=header, proxies=proxies, timeout=3)
 except:
 request_status = 500
 else:
 request_status = request.status_code
 print(request_status)
 ip_random = ip_rand
 request.encoding = 'gbk'
 html = request.content
 print(html)
 return html
 
 
def get_proxie(random_number):
 with open('ip.txt', 'r') as file:
 ip_list = json.load(file)
 if random_number == -1:
 random_number = random.randint(0, len(ip_list) - 1)
 ip_info = ip_list[random_number]
 ip_url_next = '://' + ip_info['address'] + ':' + ip_info['port']
 proxies = {'http': 'http' + ip_url_next, 'https': 'https' + ip_url_next}
 return random_number, proxies
 
 
# Program main entrance
if __name__ == '__main__':
 """Just crawled the first page of the book,Sort by evaluation"""
 start_time = datetime.datetime.now()
 url = 'https://book.douban.com/tag/?view=type&icn=index-sorttags-all'
 base_url = 'https://book.douban.com/tag/'
 html = get_html(url)
 soup = BeautifulSoup(html, 'lxml')
 article_tag_list = soup.find_all(class_='tag-content-wrapper')
 tagCol_list = soup.find_all(class_='tagCol')
 
 for table in tagCol_list:
 """ Sorting out analysis data """
 sub_type_list = []
 a = table.find_all('a')
 for book_type in a:
 sub_type_list.append(book_type.text)
 article_type_list.append(sub_type_list)
 
 for sub in article_type_list:
 for sub1 in sub:
 title = '==============' + sub1 + '=============='
 print(title)
 print(base_url + sub1 + '?start=0' + '&type=S')
 with open('book.text', 'a', encoding='utf-8') as f:
 f.write('\n' + title + '\n')
 f.write(url + '\n')
 for start in range(0, 2):
 # (start * 20) paging is 0 20 40
 # type=S is sorted by evaluation
 url = base_url + sub1 + '?start=%s' % (start * 20) + '&type=S'
 html = get_html(url)
 soup = BeautifulSoup(html, 'lxml')
 li = soup.find_all(class_='subject-item')
 for div in li:
 info = div.find(class_='info').find('a')
 img = div.find(class_='pic').find('img')
 content = 'Title:<%s>' % info['title'] + ' Book pictures:' + img['src'] + '\n'
 print(content)
 with open('book.text', 'a', encoding='utf-8') as f:
 f.write(content)
 
 end_time = datetime.datetime.now()
 print('time consuming: ', (end_time - start_time).seconds)

Why choose domestic high hide agent!

 

summary

With such a simple proxy IP, you can basically cope with the situation of crawling and crawling the blocked IP. And without using your own IP, you can protect it indirectly?!?!

You have other faster methods, welcome to come out to exchange and discuss, thank you.

Tags: Python JSON encoding Windows

Posted on Mon, 30 Mar 2020 05:04:13 -0700 by el_timm