Python crawler tutorial: free paid material crawling from baotu.com [source code attached]

We all know that it's very easy to collect a large number of design materials, but it's too expensive. Today, I'll take you to use Python crawler to crawl these materials and save them locally!

To capture the content of a website, we need to start from the following aspects:

1 - how to grab the next link of the website?

2 - whether the target resource is static or dynamic (video, picture, etc.)

3 - data structure format of the website

The source code is as follows


import requests
from lxml import etree
import threading
 
 
class Spider(object):
    def __init__(self):
        self.headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36"}
        self.offset = 1
 
    def start_work(self, url):
        print("Climbing to the top %d page......" % self.offset)
        self.offset += 1
        response = requests.get(url=url,headers=self.headers)
        html = response.content.decode()
        html = etree.HTML(html)
 
        video_src = html.xpath('//div[@class="video-play"]/video/@src')
        video_title = html.xpath('//span[@class="video-title"]/text()')
        next_page = "http:" + html.xpath('//a[@class="next"]/@href')[0]
        # Climb over
        if next_page == "http:":
            return
 
        self.write_file(video_src, video_title)
        self.start_work(next_page)
 
    def write_file(self, video_src, video_title):
        for src, title in zip(video_src, video_title):
            response = requests.get("http:"+ src, headers=self.headers)
            file_name = title + ".mp4"
            file_name = "".join(file_name.split("/"))
            print("Grabbing%s" % file_name)
            with open('E://python//demo//mp4//'+file_name, "wb") as f:
                f.write(response.content)
 
if __name__ == "__main__":
    spider = Spider()
    for i in range(0,3):
        # spider.start_work(url="https://ibaotu.com/shipin/7-0-0-0-"+ str(i) +"-1.html")
        t = threading.Thread(target=spider.start_work, args=("https://ibaotu.com/shipin/7-0-0-0-"+ str(i) +"-1.html",))
        t.start()

Effect display

For beginners who want to learn Python development, reptile technology, python data analysis, artificial intelligence and other technologies more easily, here is also a set of system teaching resources for you, plus the python technology learning course qq skirt: 855408893, free of charge. There are questions in the learning process. There are professional old drivers in the group to answer questions for free! Click to join us python learning circle

Tags: Programming Python Windows

Posted on Sat, 30 May 2020 07:38:42 -0700 by blogger3