Bird and insect (2) code knock tired? Let's have a passage.

  • ######Hello, I'm the rookie again.
  • This time, I watched the case of knowing the big guy crawling the Encyclopedia of sniffing things. I crawled down to dig the Duan subnet to share it.
  • Tired of coding? Climb a few paragraphs, paragraph five minutes code two hours.
'''
    //According to the great God of knowledge

    //Code Author: Gao Jiale

'''
import re                         ##Import re Library
import requests                   ##Import requests
import time                       ##Import time
#from bs4 import BeautifulSoup    ##Import bs4
##Define a class to remove space
class Tool():                     ##Define a class that removes spaces
    def replace(self,html):       ##Method, with two parameters
        html = re.sub(re.compile('<br/>|<br />','',html))  ##html=re.sub is a replacement function, and re.compile is a regular expression.
        return html

class Spider(object):                                      ##Define a class named reptile
    ##Initialization method
    def __init__(self):                                     ##Initialization
        self.url = 'http://www.waduanzi.com/'               ##His url is the front page, because the back is the front page plus numbers
        self.tool = Tool()                                  ##His tool is clear space

    ##Define a method to send a request
    def Request(self,getUrl):                               ##Define the request method. The parameter is geturl
        html = requests.get(getUrl)                         ##html is the address of the request parameter
        html_text = html.text                               ##html_text is after html text
        return html_text                                    ##Return his text format

    ##Define a way to get exact bytes
    def Obtain(self,obtain):                                ##Define the Obtain method with two parameters
        html_text = self.Request(obtain)                    ##html_text is his parameter, the text returned from the previous method
        ##Below is the definition of a regular expression. OK, I admit, I'm not familiar with this expression. I've matched it for a while. I watched Google match it successfully, and other browser properties will run.
        regular = re.compile('<div.*?post-box.*?post-author.*?<img target.*?>.*?<a.*?_blank".*?>(.*?)</a>.*?item-detail.*?item-title.*?item-content">(.*?)</div>.*?item-toolbar.*?fleft.*?<a.*?>(.*?)</a>.*?fleft.*?<a.*?>(.*?)</a>.*?</li>',re.S)
        itmes = re.findall(regular,html_text)               ##items get all that findall matches, but the format is a list, and each item is a tuple
        number = 1                                          ##number=1 is the counter
        for itme in itmes:                                  ##Because it's a list, what to use, for traversal
            print('The first%d individual\n Floor owner:%s\n Text:%s\n Give the thumbs-up:%s\n Step:%s'%(number,itme[0],itme[1],itme[2],itme[3]))##The output format is as follows
            print()                                         ##This is line feed output
            number+=1                                       ##Counter will be + 1
        return itmes                                        ##Return to that collection

    ##Save file
    def save(self,data,name):                               ##Define two parameters of save method
        filName = 'page'+name+'.txt'                        ##Define a name
        f = open(filName,'wb')                              ##f = defined name and readable / writable mode
        f.write(data.encode('utf-8'))                       ##utf8 mode of writing data
        f.close()                                           ##Close files, open and close as you go is a good child

    ##Action open save
    def onesave(self,url,save):                             ##Define a method to save after opening
        html = self.Obtain(url)                             ##html is to use the previous method to get the list
        self.save(str(html),str(save))                      ##Save (STR (HTML, str (save)) the first STR is converted to a string, and the second is also because the second STR is + connected

    ##How many pages
    def page(self,star,end):                                ##Define the method on page, two parameters, start and end
        if star == 1:                                       ##If 1 entered
            print('Reading page 1')                            ##Reading first page
            self.onesave(self.url,star)                     ##Open and save the first page. The url is the initialization homepage
            print('End of first page acquisition')                            ##End of first page acquisition
            number = 2                                      ##Then number=2 is to count, because the first page has already started, so from the second
            for i in range(number,end+1):                   ##Then loop, starting from the second one and ending at end+1. Why? Because for takes the head and doesn't take the tail
                print('Reading%s page'%i)                       ##Reading i
                page = self.url+'/page/'+str(i)             ##Page is the number of pages, because it is known in the page that the number of pages is the number of pages on the homepage + / page /
                self.onesave(page,i)                        ##Then open save
                print('%s Page end'%i)                          ##Output end
                time.sleep(2)                               ##Wait time, reptiles. Climb politely
                number+=1                                   ##Counter
            if number == end+1:                             ##The upper loop is finished to judge whether the counter is = end+1. No accident
                print('End of loading')                             ##End of loading
                return False                                ##Return to False
        elif star>1:                                        ##If the start is greater than one
            number = star                                   ##Let the counter = input
            for i in range(star,end+1):                     ##Loop + 1 loop from the sum of inputs
                print("Reading%s page"%i)                       ##Reading
                page = self.url+'/page/'+str(i)             ##Is to get the website
                self.onesave(page,i)                        ##Open, save
                print('%s End of reading'%i)                         ##End of reading
                time.sleep(2)                               ##Politeness, politeness, on the quality of reptiles
                number+=1                                   ##What's the point of counter not + 1
            if number == end+1:                             ##End of loop equals end+1
                print('Loading ended')                           ##Loop end
                return False                                ##Return to False


duqu = Spider()                                             ##instantiation
duqu.page(star=(int(input('Please enter the start you want to get'))),end=int(input('Please enter the number of ending pages')))  ##Start and end of page acquisition of instance
  • I was afraid of shrinking in and out, so I copied and pasted my code. It's just that lazy. Hit me.
  • Code completion is like this.

  • OK, that's my code. I just learned my class, so I used it

  • Maybe I always try. The website seems to be emmm, which intercepted me. I didn't use an agent. I didn't do it if I was not familiar with it.
  • Reptiles, be polite, be polite.
Well, it's time to see you next time. When I learn new knowledge, do the next case, smash the door, see you next time, bye bye. !

Tags: Google

Posted on Sun, 09 Feb 2020 10:33:46 -0800 by n000bie