Web crawler - project practice (crawling through all articles in Encyclopedia of embarrassing events)

Project analysis

1, First of all, we're ready to crawl the website Encyclopedia Website (http://www.qiushibaike.com/), you can check the source code of the article to find out the rules of the content we want to crawl. Here is part of the code I extracted

<div class="content">
<span>
When the son was three years old, he often followed his grandmother, who opened a chicken farm and raised three or four thousand chickens. The happiest thing for the child every day was to go to the chicken house with his grandmother to pick up eggs, pick up the big wooden box, pull it out, and then put it into the egg holder for packing. < br / >. Instead, I wiped and boxed the remaining good eggs one by one, and left all the worse eggs I could eat for one. < br / >,
...
</span>
< span class = "contentforall" > View Full Text</span>
</div>

<a href="/article/122710686" target="_blank" class="contentHerf" onclick="_hmt.push(['_trackEvent','web-list-content','chick'])">
<div class="content">
<span>
The unit arranges some of our managers to work first! < br / > on this day, the leader suddenly checked the work and asked to wear a good mask to welcome! < br / > at that time, I was sorting out the documents. I couldn't find the mask I just removed in a hurry. In a hurry, I grabbed a bag of masks on the desk of my beautiful colleague and ran out! < br / > when I was about 20 meters away from the leaders, I quickly took out my mask and prepared to put it on. When I took it out, I found it was a sanitary napkin! < br / > ouch, my face!
</span>
</div>

Combined with the above code and the following front-end style, we can draw the conclusion that all articles are in the range of < div class = "content" >

2, Open and observe the interface, we can find that it is composed of many pages. If we want to crawl all the content, we need to turn the page at the same time, which requires us to analyze how to turn the page.
Page 2: https://www.qiushibaike.com/hot/page/2/
Page 3: https://www.qiushibaike.com/hot/page/3/
Page 4: https://www.qiushibaike.com/hot/page/4/
So you'll find the law

3, If I'm not clear about what I'm talking about, let's look at the code. If you don't understand the code, you can see my last article Introduction to Urlib module

#Create an array of user agent pools
uapools=[
    "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36",#360
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.117 Safari/537.36",#Google
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:60.0) Gecko/20100101 Firefox/60.0",
    "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.5",
    "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:48.0) Gecko/20100101 Firefox/48.0",
]
#Define a method UA()
def UA():
    #Create a urllib. Request. Build Gu opener() object
    opener = urllib.request.build_opener()
    #Randomly select a user agent from uapools each time
    thisua = random.choice(uapools)
    #Save a user agent in JSON format (key value pair)
    ua = ("User-Agent",thisua)
    #Assign user agent to opener.addheaders
    opener.addheaders=[ua]
    #Change the installation and opener of urlib.request to our
    urllib.request.install_opener(opener)
    print("Current use User-Agent: "+str(thisua))

sum = 0
for i in range(0,13):
    #From convenience on the first page to convenience on the thirteenth page
    url = "https://www.qiushibaike.com/hot/page/"+ str(i+1) +"/"
    UA()
    #This time, you can crawl the content of the website happily
    data = urllib.request.urlopen(url).read().decode("utf-8","ignore")
    string = '<div class="content">.*?<span>(.*?)</span>.*?</div>'
    res = re.compile(string,re.S).findall(data)
    for j in range(0,len(res)):
        print(res[j])
        print("-------------Page 1" + str(j) + "strip------------")
        sum = sum + 1
print("Share"+ str(sum) +"Article")

Let's take a look at the running results: it seems that there are 261 articles in the Encyclopedia of embarrassing events as of February 7, 2020

Where did not understand, or I have the wrong place, welcome to point out in the comment area, thank you!

Published 9 original articles, won praise 57, visited 8254
Private letter follow

Tags: Windows Firefox Google JSON

Posted on Fri, 07 Feb 2020 01:34:42 -0800 by Pie