Grabbing the dynamic loaded profile page

preparation

Using the language python
Required library requests, re, math

Get dynamic links

Open the profile home page, recommend Firefox and Chrome, right-click to view the elements, select network options, then swipe the page, load new content, and find the links and request headers as shown in the following figure.

Get load link

Link address:

https://www.jianshu.com/u/d2121d5ecf94?order_by=shared_at&page=2

The last parameter, page=2, is the number of pages. The total number of pages will be calculated by the total number of articles later.

The request header is stored as a dictionary as follows:

headers = {
'Accept':'text/html, */*; q=0.01', 
'Accept-Encoding':'gzip, deflate, br', 
'Accept-Language':'zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2', 
'Connection':'keep-alive', 
'Cookie':'Hm_lvt_0c0e9d9b1e7d617b3e6842e85b9fb068=1515486399,1515486408,1515505449,1515550122; sensorsdata2015jssdkcross=%7B%22distinct_id%22%3A%228748118%22%2C%22%24device_id%22%3A%22160d07688ab449-08fc8d4ba508368-4c322e7d-1327104-160d07688ac316%22%2C%22props%22%3A%7B%22%24latest_traffic_source_type%22%3A%22%E7%9B%B4%E6%8E%A5%E6%B5%81%E9%87%8F%22%2C%22%24latest_referrer%22%3A%22%22%2C%22%24latest_referrer_host%22%3A%22%22%2C%22%24latest_search_keyword%22%3A%22%E6%9C%AA%E5%8F%96%E5%88%B0%E5%80%BC_%E7%9B%B4%E6%8E%A5%E6%89%93%E5%BC%80%22%2C%22%24latest_utm_source%22%3A%22weixin%22%2C%22%24latest_utm_medium%22%3A%22reader_share%22%2C%22%24latest_utm_campaign%22%3A%22haruki%22%2C%22%24latest_utm_content%22%3A%22note%22%7D%2C%22first_id%22%3A%22160d07688ab449-08fc8d4ba508368-4c322e7d-1327104-160d07688ac316%22%7D; remember_user_token=W1s4NzQ4MTE4XSwiJDJhJDExJGJLRERsWHFzY0N5U2lPNHFIN1B4Wk8iLCIxNTE1MzI1OTM3LjkzMzQ4NzciXQ%3D%3D--8a5ac5ef497afd64b897cb834f68d7c8721e2a58; read_mode=day; default_font=font2; locale=zh-CN; _m7e_session=d6bfc751747df6b4b7a2d40d38274bf9; Hm_lpvt_0c0e9d9b1e7d617b3e6842e85b9fb068=1515550507', 
'Host':'www.jianshu.com', 
'Referer':'https://www.jianshu.com/u/d2121d5ecf94', 
'User-Agent':'maoGbanLGlZUl3lesp7UfB+2qZGuaLopoDeb8kRGecBdjtsdzH+NOJ2bvp1JLfaPyoDnbh4NS7vVjUHCG0D/6Q==', 
'X-CSRF-Token':'maoGbanLGlZUl3lesp7UfB+2qZGuaLopoDeb8kRGecBdjtsdzH+NOJ2bvp1JLfaPyoDnbh4NS7vVjUHCG0D/6Q==', 
'X-INFINITESCROLL':'true', 
'X-Requested-With':'XMLHttpRequest'
}

Crawling idea

Get the dynamic load link, calculate the number of pages through the number of articles, and then grab the page information through the requests loop, and write it all to the file temp.txt. Finally, we get the needed information through re and store it.

At the end of this example, the results are stored in the form of a list.

The detailed code is as follows:

#coding=utf-8

import os
import requests
import re
import math

# Total articles
count=22

# Delete temporary files
if os.path.isfile('temp.txt'):  
    os.remove('temp.txt')

# Delete old files
if os.path.isfile('article.py'):
    os.remove('article.py')

# Custom headers
headers = {
'Accept':'text/html, */*; q=0.01', 
'Accept-Encoding':'gzip, deflate, br', 
'Accept-Language':'zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2', 
'Connection':'keep-alive', 
'Cookie':'Hm_lvt_0c0e9d9b1e7d617b3e6842e85b9fb068=1515486399,1515486408,1515505449,1515550122; sensorsdata2015jssdkcross=%7B%22distinct_id%22%3A%228748118%22%2C%22%24device_id%22%3A%22160d07688ab449-08fc8d4ba508368-4c322e7d-1327104-160d07688ac316%22%2C%22props%22%3A%7B%22%24latest_traffic_source_type%22%3A%22%E7%9B%B4%E6%8E%A5%E6%B5%81%E9%87%8F%22%2C%22%24latest_referrer%22%3A%22%22%2C%22%24latest_referrer_host%22%3A%22%22%2C%22%24latest_search_keyword%22%3A%22%E6%9C%AA%E5%8F%96%E5%88%B0%E5%80%BC_%E7%9B%B4%E6%8E%A5%E6%89%93%E5%BC%80%22%2C%22%24latest_utm_source%22%3A%22weixin%22%2C%22%24latest_utm_medium%22%3A%22reader_share%22%2C%22%24latest_utm_campaign%22%3A%22haruki%22%2C%22%24latest_utm_content%22%3A%22note%22%7D%2C%22first_id%22%3A%22160d07688ab449-08fc8d4ba508368-4c322e7d-1327104-160d07688ac316%22%7D; remember_user_token=W1s4NzQ4MTE4XSwiJDJhJDExJGJLRERsWHFzY0N5U2lPNHFIN1B4Wk8iLCIxNTE1MzI1OTM3LjkzMzQ4NzciXQ%3D%3D--8a5ac5ef497afd64b897cb834f68d7c8721e2a58; read_mode=day; default_font=font2; locale=zh-CN; _m7e_session=d6bfc751747df6b4b7a2d40d38274bf9; Hm_lpvt_0c0e9d9b1e7d617b3e6842e85b9fb068=1515550507', 
'Host':'www.jianshu.com', 
'Referer':'https://www.jianshu.com/u/d2121d5ecf94', 
'User-Agent':'maoGbanLGlZUl3lesp7UfB+2qZGuaLopoDeb8kRGecBdjtsdzH+NOJ2bvp1JLfaPyoDnbh4NS7vVjUHCG0D/6Q==', 
'X-CSRF-Token':'maoGbanLGlZUl3lesp7UfB+2qZGuaLopoDeb8kRGecBdjtsdzH+NOJ2bvp1JLfaPyoDnbh4NS7vVjUHCG0D/6Q==', 
'X-INFINITESCROLL':'true', 
'X-Requested-With':'XMLHttpRequest'
}

# Get home page content
for i in range(1,math.ceil(count/9)+1):  # math.ceil(count/9) is the number of calculated pages
    url = 'https://Www.jianshu.com/u/d2121d5ecf94? Order by = shared at & page =% d '% I? Link address when loading
    response = requests.get(url,headers=headers)  # request
    text = response.content.decode("utf-8") 

    with open('temp.txt', 'a',encoding='utf-8') as file:
        file.write(text)  # Store the request result in temp.txt, which needs to be written with a

f=open('temp.txt','r',encoding='utf-8')  # Read temp.txt
text = f.read()
f.close

time = re.findall('<span class="time" data-shared-at="(.*?)T(.*?)\+08:00">',text)  # Get release time
title = re.findall('<a class="title" target="_blank" href="(.*?)">(.*?)<\/a>', text)  # Get title
content = re.findall('<p class="abstract">\n      (.*?)\n    </p>',text,re.S)  # Get summary

result = []
for i in range(len(time)):  # Add the acquired content to the list
    result.append((time[i][0]+' '+time[i][1],'https://www.jianshu.com'+title[i][0],title[i][1],content[i]))

with open('result.txt', 'w+',encoding='UTF-8') as file:  # Store results in result.txt
    file.write(str(result))

Execute the above code to get the release time, article title, and article introduction. If you need other information, you can grab it by yourself, such as the avatar, the author, the number of reading, the number of comments, the number of likes, etc.

This article is only for technical exchange, please do not use it for other ways. Copyright, reprint please contact the author, thank you!

Tags: encoding Python Firefox network

Posted on Sat, 02 May 2020 19:48:29 -0700 by squimmy