How to use python to crawl NetEase news

Preface

The text and pictures in this article are from the Internet. They are for study and communication only. They do not have any commercial use. The copyright is owned by the original author. If you have any questions, please contact us in time for processing.

Author: LSGOGroup

PS: If you need Python learning materials for your child, click on the link below to get them

http://note.youdao.com/noteshare?id=3054cce4add8a909e784ad934f956cef

After learning the basic python grammar, I became very interested in crawlers. I don't talk much about them. Today I crawl NetEase news to get the truth.

Open NetEase News, you can find that the news is divided into several sections:

This time select the domestic sector to crawl articles.

1. Prepare

  • Environment: python3

  • Compiler: PyCharm

  • Install selenium to drive driver for three browsers

Download Address

  • chromedriver: https://code.google.com/p/chromedriver/downloads/list

  • Firefox driver geckodriver: https://github.com/mozilla/geckodriver/releases/

  • IE Driver: http://www.nuget.org/packages/Selenium.WebDriver.IEDriver/

Learn about Web Pages

The web page is gorgeous and beautiful, like a watercolor painting.To crawl data, you first need to know how the data you need to capture is presented, like learning to make a picture. Before you start, you need to know what the picture is drawn with, whether it is a pencil or a watercolor pen...There may be many different types, but when you put it on a web page, there are only two ways to present it:

  • HTML

  • JSON

HTML is a language used to describe Web pages

JSON is a lightweight data exchange format

Crawling web information is actually a request to a web page, and the server will feed you the data

2. Get dynamic loading source

Import the required modules and libraries:

1 from bs4 import BeautifulSoup
2 import time
3 import def_text_save as dts
4 import def_get_data as dgd
5 from selenium import webdriver
6 from selenium.webdriver.common.keys import Keys
7 from selenium.webdriver.common.action_chains import ActionChains #Introduce ActionChains Mouse Action Class

 

Getting web information requires sending requests, requests can help us do it well, but after careful observation we find that NetEase news is loaded dynamically, requests returns instant information, and the data loaded later in the web page is not returned. In this case selenium can help us get more data, and we understand selenium as an automated testerWell, the Selenium test runs directly in the browser, just like a real user does.

The browser I use is Firefox

1 browser = webdriver.Firefox()#Switch by Browser
2 browser.maximize_window()#maximize window
3 browser.get('http://news.163.com/domestic/')

 

So we can drive the browser to automatically log on to NetEase news pages

Our goal is, naturally, to crawl down the domestic sector at once, watch the web pages, and load new news as the pages continue to refresh downwards, even with a button click at the bottom:

At this point, selenium can show its advantages: automation, simulating mouse and keyboard operations:

1 diver.execute_script("window.scrollBy(0,5000)")
2 #Pull down the page with each drop-down value in parentheses

 

Right-click the Load More button in the web page and click View Elements to see

With this class, you can navigate to the button. When you touch the button, the click event helps us automatically click the button to refresh the web page.

 1 # Crawl blocks to dynamically load part of the source code
 2  3 info1=[]
 4 info_links=[]    #Store article content links
 5 try:
 6     while True :
 7         if  browser.page_source.find("load_more_btn") != -1 :
 8             browser.find_element_by_class_name("load_more_btn").click()
 9         browser.execute_script("window.scrollBy(0,5000)")
10         time.sleep(1)
11 except:
12     url = browser.page_source#Return loaded full page source
13     browser.close()#Close Browser

 

  1. Get useful information

Simply put, BeautifulSoup is a python library whose primary function is to grab data from web pages and lighten the burden on newbies.BeautifulSoup parses the source code of a web page, and with the attached functions, we can easily retrieve the information we want, such as getting the title of an article, tags, and hyperlinks to text content

Also right-click in the article title area to see the elements:

Looking at the structure of the web page, we find that each div tag class="news_title" is followed by the title and hyperlink of the article.The soup.find_all() function can help us find all the information we want, and the contents of this hierarchy can be extracted at once.Finally, through the dictionary, the label information is taken out one by one.

 1 info_total=[]
 2 def get_data(url):
 3     soup=BeautifulSoup(url,"html.parser")
 4     titles=soup.find_all('div','news_title')
 5     labels=soup.find('div','ns_area second2016_main clearfix').find_all('div','keywords')
 6     for title, label in zip(titles,labels ):
 7             data = {
 8                 'Article Title': title.get_text().split(),
 9                 'Article Label':label.get_text().split() ,
10                 'link':title.find("a").get('href')
11             }
12             info_total.append(data)
13     return info_total

 

4. Get news

Since then, the news links have been taken out and stored in the list. Now all you need to do is use the links to get the news topic content.News topic content pages are loaded statically and requests can be handled easily:

 1 def get_content(url):
 2     info_text = []
 3     info=[]
 4     adata=requests.get(url)
 5     soup=BeautifulSoup(adata.text,'html.parser')
 6     try  :
 7         articles = soup.find("div", 'post_header').find('div', 'post_content_main').find('div', 'post_text').find_all('p')
 8     except :
 9         articles = soup.find("div", 'post_content post_area clearfix').find('div', 'post_body').find('div', 'post_text').find_all(
10             'p')
11     for a in articles:
12         a=a.get_text()
13         a= ' '.join(a.split())
14         info_text.append(a)
15     return (info_text)

 

The reason for using try except is that NetEase news articles have different labels around a certain time period, and should be handled differently in different situations.

Finally, traverse the entire list to fetch all the text:

 1 for i in  info1 :
 2     info_links.append(i.get('link'))
 3 x=0   #Control access to article directories
 4 info_content={}# Store article content
 5 for i in info_links:
 6     try :
 7         info_content['Article Content']=dgd.get_content(i)
 8     except:
 9         continue
10     s=str(info1[x]["Article Title"]).replace('[','').replace(']','').replace("'",'').replace(',','').replace('<','').replace('>','').replace('/','').replace(',',' ')
11     s= ''.join(s.split())
12     file = '/home/lsgo18/PycharmProjects/NetEase News'+'/'+s
13     print(s)
14     dts.text_save(file,info_content['Article Content'],info1[x]['Article Label'])
15     x = x + 1

 

  1. Store data to local txt file

python provides a function to process files, open(), with the first parameter being the file path, the second being the file processing mode, and the w mode being write-only (create if no file exists, empty if there is one).

1 def text_save(filename, data,lable):   #filename For Writing CSV Path to file
2     file = open(filename,'w')
3     file.write(str(lable).replace('[','').replace(']','')+'\n')
4     for i in range(len(data)):
5         s =str(data[i]).replace('[','').replace(']','')#Remove[],The two rows are different by data, so you can choose
6         s = s.replace("'",'').replace(',','') +'\n'   #Remove single quotes, commas, and append line breaks at the end of each line
7         file.write(s)
8     file.close()
9     print("Save file successfully")

 

A simple crawler was written successfully:

So far, the method of crawling NetEase news is finished, I hope it will be useful for everyone!See You!

Tags: Python Selenium Firefox JSON

Posted on Thu, 28 Nov 2019 23:54:10 -0800 by Arsench