Using the python 3 crawler to grab a web page to download a novel

Many times I want to read novels but I can't find any resources on the website. Even if I find the resources, I can't download them. Of course, I can read novels on my mobile phone after downloading them!

So the programmer's thinking came out. If I can't download it, I will use the crawler to climb down each chapter and store it in a txt file. In this way, a novel will climb down.

The book I climbed this time is "hacker", a network novel. I believe many people have read it and have a look at its code.

The code is as follows:

import re
import urllib.request
import time

#
root = 'http://www.biquge.com.tw/3_3542/'
# Fake browser
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) ' \
                         'AppleWebKit/537.36 (KHTML, like Gecko)'
                         ' Chrome/62.0.3202.62 Safari/537.36'}

req = urllib.request.Request(url=root, headers=headers)

with urllib.request.urlopen(req, timeout=1) as response:
    # Most of the web pages dealing with novels have charset='gbk',So use gbk Code
    htmls = response.read().decode('gbk')

# Match all directories<a href="/3_3542/2020025.html">HK002 God gave me a chance to be a good man</a>
dir_req = re.compile(r'<a href="/3_3542/(\d+?.html)">')
dirs = dir_req.findall(htmls)

# Create a file stream and read each chapter into memory
with open('hacker.txt', 'w') as f:
    for dir in dirs:
        # Combined link address, i.e. address of each chapter
        url = root + dir
        # Sometimes when I visit a web page, I can't get a response, and the program will get stuck there. I let him 0.6 Automatically timeout after seconds and throw an exception
        while True:
            try:
                request = urllib.request.Request(url=url, headers=headers)
                with urllib.request.urlopen(request, timeout=0.6) as response:
                    html = response.read().decode('gbk')
                    break
            except:
                # For the exception caught, I have the program stop 1.1 Second, recycle to access the link again. Once the access is successful, exit the cycle
                time.sleep(1.1)
                
        # Match article title
        title_req = re.compile(r'<h1>(.+?)</h1>')
        # Match the content of the article, with line breaks in the content, so flags=re.S
        content_req = re.compile(r'<div id="content">(.+?)</div>',re.S,)
        # Get the title
        title = title_req.findall(html)[0]
        # Get the content
        content_test = content_req.findall(html)[0]
        # For html Replacement of element impurities
        strc = content_test.replace('&nbsp;', ' ')
        content = strc.replace('<br />', '\n')
        print('Grab chapter>' + title)
        f.write(title + '\n')
        f.write(content + '\n\n')

In this way, a novel will be downloaded down!!!

The operation is shown in the figure:

 

Sometimes the server will think that you are a robot because of a large number of accesses, so it will block your IP. You can add a random number to let the program stop randomly at different times.

If the download is too slow, you can use multiple threads to download multiple chapters together

Tags: Python Mobile network Windows

Posted on Tue, 07 Apr 2020 08:52:59 -0700 by truckie2