Python implementation of news data mining

preface

The text and pictures of this article are from the Internet, only for learning and communication, not for any commercial purpose. The copyright belongs to the original author. If you have any questions, please contact us in time for handling.

No matter you are zero foundation or have foundation, you can get the corresponding study gift pack! It includes Python software tools and 2020's latest introduction to actual combat. Add 695185429 for free.

1. Extract Baidu news title, website, date and source

1.1 get source code of web page

We can get the source code of the webpage through the following code. In the example, the code is to get the source code of the webpage searching Alibaba in Baidu news.

import requests

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
                             'AppleWebKit/537.36 (KHTML, like Gecko) '
                             'Chrome/77.0.3865.120 Safari/537.36'}
res = requests.get(url, headers=headers)
web_text = res.text

 

Because Baidu news website only recognizes the requests sent by the browser, it needs to set the headers parameter to simulate the sending requests of the browser. chrome browser can use about:version obtain.

1.2 write regular expression to extract news information

1.2.1 source and date of news extraction

 

By observing the source code of the web page, we find that the source and release date of each news are between < p class = "c-author" > and < / P >. Therefore, we can get the intermediate source and date information through regular expression.

pattern = '<p class="c-author">(.*?)</p>'
info = re.findall(pattern, web_text, re.S)      # re.S Used to consider line breaks because.and*Does not contain line breaks
print(info)

 

The obtained information contains many contents, such as spaces, line breaks, tabs and < img > tags, which need to be cleaned twice, which will be introduced in the later chapters.

1.2.2 extract the website and title of news.

In order to extract the news website and title, we need to find the rules from the source code of the web page as in the previous section. Through the source code obtained, we find that there are < H3 class = "c-title" > in front of the news address.

 

 

Through the following two pieces of code, you can get the URL and title of the news respectively.

    pattern_herf = '<h3 class="c-title">.*?<a href="(.*?)"'
    herf = re.findall(pattern_herf, web_text, re.S)
    print(herf)

    pattern_title = '<h3 class="c-title">.*?>(.*?)</a>'
    title = re.findall(pattern_title, web_text, re.S)
    print(title)

 

The acquired data also needs secondary data cleaning.

1.2.3 data cleaning

  1. News Title Cleaning
    There are two problems in extracting news title data: one is that the ending of each title contains line breaks and some spaces; the other is that there are invalid characters such as < EM > and < / EM > in the middle.

(1) Remove unnecessary spaces and line breaks through the stip() function.

 

for i in range(len(title)):
    title[i] = title[i].strip()

 

2) Using the sub() function to deal with < EM > and</em>

for i in range(len(title)):
    title[i] = title[i].strip()
    title[i] = re.sub('<.*?>', '', title[i])

 

2 news sources and date cleaning
Problems in the extracted news source and date: a lot of < imag > tag information; news source and date are linked together; a lot of line breaks, tabs, spaces, etc

    for i in range(len(info)):
        info[i] = re.sub('<.*?>', '', info[i])            # clean<img>Label information
        source.append(info[i].split('&nbsp;&nbsp;')[0])   # Separate news sources from dates
        date.append(info[i].split('&nbsp;&nbsp;')[1])
        source[i] = source[i].strip()
        date[i] = date[i].strip()

 

 

2. Obtain Baidu news of multiple companies in batches and generate data reports

This chapter is mainly used to obtain the information of multiple companies in batches, automatically generate data reports and export them to a text file

2.1 batch access to Baidu news of multiple companies

Here, we encapsulate the work of crawling web pages into a function.

 

def baidu_news(company):
    """
    //Get the source code of the web page, and extract Baidu news title, website, date and source
    :param company: corporate name
    :return: Source code of webpage
    """
    url = 'https://www.baidu.com/s?rtt=1&bsst=1&cl=2&tn=news&word=' + company
    # Baidu news website only recognizes requests sent by browser, so it needs to set headers Parameters,
    # To simulate the sending request of the browser, chrome The browser can use the about:version obtain
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
                             'AppleWebKit/537.36 (KHTML, like Gecko) '
                             'Chrome/77.0.3865.120 Safari/537.36'}
    res = requests.get(url, headers=headers)
    web_text = res.text

    # Get news sources and dates
    pattern = '<p class="c-author">(.*?)</p>'
    info = re.findall(pattern, web_text, re.S)  # re.S Used to consider line breaks because.and*Does not contain line breaks
    # print(info)

    # Get news URL and title
    pattern_herf = '<h3 class="c-title">.*?<a href="(.*?)"'
    herf = re.findall(pattern_herf, web_text, re.S)
    # print(herf)

    pattern_title = '<h3 class="c-title">.*?>(.*?)</a>'
    title = re.findall(pattern_title, web_text, re.S)
    # print(title)

    # title Data cleaning
    for i in range(len(title)):
        title[i] = title[i].strip()
        title[i] = re.sub('<.*?>', '', title[i])

    # print(title)

    # News source and date cleaning
    source = []
    date = []

    for i in range(len(info)):
        info[i] = re.sub('<.*?>', '', info[i])  # clean<img>Label information
        source.append(info[i].split('&nbsp;&nbsp;')[0])  # Separate news sources from dates
        date.append(info[i].split('&nbsp;&nbsp;')[1])
        source[i] = source[i].strip()
        date[i] = date[i].strip()

        print(str(i+1) + '.' + title[i] + '(' + date[i] + '-' + source[i] + ')')

 

Then write the calling procedure in the main function.

    companys = ['Huaneng trust', 'tencent', 'Alibaba']
    for company in companys:
        baidu_news(company)
        print(company + 'Baidu news crawls successfully')

 

The results are as follows:

1. Which is the best trust performance? Last year, the two companies had a net profit of more than 3 billion. Huachen Huarong was a bit "bleak" (23:38, January 21, 2020 - Daily Economic News)
2.which of the 57 trusts is the best? CITIC, Huaneng and Chongqing Trust are among the top three (07:47, January 20, 2020 - Sina Finance)
3. Net profit ranking in 2019: CITIC Trust is the top 3.593 billion (16:51, January 16, 2020 - financial sector)
4. Special plan for "Huaneng trust chain finance technology sincerity phase 2 supply chain financial assets support"... (23:10, January 16, 2020 - financial sector)
5. Which is the best trust company in 2019? (19:53, January 17, 2020 - China Financial News Network)
6. Last year, 185 listed companies' electronic equipment manufacturing industry was the focus of trust research (10:07, January 6, 2020 - China Foundation Network)
7. About providing "Huaneng trust - one party sincerity phase 3 special plan of supply chain financial asset support" (23:10-sina, January 6, 2020)
8. Muyuan shares and Huaneng trust plan to set up two pig raising subsidiaries with 7.2 billion yuan (22:28, December 11, 2019 - tonghuashun finance and Economics)
9. Muyuan Co., Ltd. raised pigs in many places and bloomed, borrowed money from lihuaneng trust (10:32-caixin, December 11, 2019)
10. The joint venture established by muyuan shares (002714.SZ) and Huaneng Guicheng trust has been registered and established (21:09-sina, December 12, 2019)
Huaneng trust's success in Baidu news
1. Tencent assisted police in solving the first N95 mask fraud in Guangdong (1 hour ago - China News)
2. Tencent combined with micro medicine and other five platforms to provide free diagnosis services (1 hour ago - Sina)
3. Tencent combined with five platforms, online free diagnosis of new pneumonia (1 hour ago - Mobile Phoenix)
4. Tencent documents are open to free members and fully support telecommuting (4 hours ago - Sina)
5. The worst game studio! In 2013, 9 masterpieces were so poor that Tencent took a fancy to them and launched (1 hour ago - 17173 game net)
6. TME Tencent musicians actively created public songs "sound" to help Wuhan (54 minutes ago - Tencent Technology)
7. If CDB NS is not the focus of Tencent Nintendo cooperation, what is the focus? (1 hour ago - Sina)
8. Tencent cloud provided free cloud supercomputing to Professor Huang Niu's laboratory and Professor Luo Haibin's team (23 minutes ago - titanium media)
9. No suspension of classes Tencent class helped Chongqing No.11 middle school and No.3 high school off the course for the first time online (6 hours ago - Global Network)
10. Tencent document open free member function number of collaborative editors to 200 (50 minutes ago - Zhongguancun Online)
Success of Tencent Baidu news
1. Alibaba's direct medical supplies from 14 countries around the world will arrive in China successively through China Eastern Airlines (released in Wuhan 39 minutes ago)
2. The first batch of N95 masks purchased by Alibaba global medical materials arrived in Wuhan today (29 minutes ago - Zhejiang News)
3. Alibaba and China Eastern Airlines jointly purchase and transport medical materials in 14 countries (1 hour ago - Sina)
4. Alibaba's global purchase of medical materials arrived in Shanghai in succession (48 minutes ago - tonghuashun finance and Economics)
5. After Alibaba's "ban", businesses offer high imitation masks! Seize this point and make a move to identify (9 minutes ago - IT cracker king)
6. Alibaba and other 100 enterprises promise that the price of anti epidemic and livelihood products will not rise (6 hours ago - TechWeb)
7. [Alibaba and other enterprises jointly launched the initiative of "Three Guarantees let us go together"] (5 hours ago - Sina)
8. Alibaba online fever clinic query has covered 5734 fever clinics (11:10, January 29, 2020 - China News Network)
9. The masks for Wuhan epidemic are out of stock. Alibaba sent an urgent notice that they can't be bought! (1 hour ago - Science and technology season)
10. Alibaba and other enterprises respond to the "Three Guarantees" action of the General Administration of market supervision (23:47, January 29, 2020 - Sina Finance)
Success of Alibaba Baidu news

2.2 automatic generation of public opinion data report text file

The previous section has crawled to the news we want, and generated the public opinion results. Next, export it to a text file.
In Baidu_ After the news() function, add the relevant operation code to export to the file as follows:

    file_ = open('Data Mining Report.txt', 'a')     #Append mode, do not clear the original data
    file_.write(company + 'News data:' + '\n' + '\n')
    for i in range(len(title)):
        file_.write(str(i+1) + '.' + title[i] + '(' + date[i] + '-' + source[i] + ')' + '\n')
        file_.write(href[i] + '\n')

    file_.write('----------------------' + '\n' + '\n')
    file_.close()

 

3. Exception handling and 24-hour real-time data mining practice

3.1 exception handling

Here we need to use the function Baidu_ The execution of news () does exception handling.

    companys = ['Huaneng trust', 'tencent', 'Alibaba']
    for company in companys:
        try:
            baidu_news(company)
            print(company + 'Baidu news crawls successfully!')
        except:
            print(company + 'Baidu news crawling failed!')

 

3.2 24-hour real-time crawling

The implementation of this function is relatively simple. You only need to set a layer of while True loop outside the program, and then you can reference the time library after each loop is executed time.sleep() function to allow the program to execute at regular intervals.
Through the previous code, we have been able to achieve 24-hour uninterrupted access to news content, but it is inevitable to crawl to duplicate news data, which involves the content of data De duplication, and requires the relevant knowledge of database. This part of content will be introduced in the later chapters, and interested readers can continue to pay attention.

4. Crawling and batch crawling multi page content in chronological order

The previous chapter only crawls the content of the first page of Baidu news search results. The information data is not comprehensive. This chapter will introduce the method of crawling multi page content in batch.

4.1 crawling Baidu news in chronological order

The content of code modification is not involved here, because Baidu news arranges news content according to the "order of focus" by default. Here, we can select the "order by time" button in the upper right corner of the search results page, and then modify the url.

4.2 one time batch crawling multi page content

If we want to achieve crawling multi page content, we need to analyze the difference of each page URL.
The website of the first page is:

https://www.baidu.com/s?tn=news&rtt=1&bsst=1&cl=2&wd= Alibaba

The website of the second page is

https://www.baidu.com/s?tn=news&rtt=1&bsst=1&cl=2&wd= Alibaba & PN = 10

The website of the second page is.

https://www.baidu.com/s?tn=news&rtt=1&bsst=1&cl=2&wd= Alibaba & PN = 20

Through comparison, it can be found that the difference between page URLs is & PN = XX. Here it can be determined that the content & PN = 0 can be added after the first page URL. Therefore, the code can be modified as follows.

 

def baidu_news(company, page):
    """
    //Get the source code of the web page, and extract Baidu news title, website, date and source
    :param company: corporate name
    :param page: Number of pages to crawl
    :return: Source code of webpage
    """
    num = (page - 1) * 10

    url = 'https://www.baidu.com/s?rtt=1&bsst=1&cl=2&tn=news&word=' + company + '&pn=' + str(num)

    res = requests.get(url, headers=headers, timeout=10)
    web_text = res.text
    # Code for data extraction, cleaning and crawling is omitted here

def main():
    companys = ['Huaneng trust', 'tencent', 'Alibaba']
    for company in companys:
        for i in range(5):      # Crawling 5 pages
            try:
                baidu_news(company, i+1)
                print(company + str(i+1) + 'Page news crawled successfully!')
            except:
                print(company + str(i+1) + 'Page news crawling failed!')

  

5. Sogou news and Sina Financial Data Mining

The method adopted here is similar to that of crawling Baidu news.

5.1 Sogou news data crawling

  1. First, get the website of Sogou news. We search "Alibaba" in Sogou news and get the website as (deleted):

https://news.sogou.com/news?query= Alibaba


The complete code is as follows:

"""
    //By Aidan
    //Time: 30 / 01 / 2020
    //Function: crawling Search Dog News Data
"""

import requests
import re

# Baidu news website only recognizes requests sent by browser, so it needs to set headers Parameters,
# To simulate the sending request of the browser, chrome The browser can use the about:version obtain
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
                         'AppleWebKit/537.36 (KHTML, like Gecko) '
                         'Chrome/77.0.3865.120 Safari/537.36'}

def sougou_news(company, page):
    """
    //Get the source code of the web page and extract the news title, website, date and source of Sogou
    :param company: corporate name
    :param page: Number of pages to crawl
    :return: Source code of webpage
    """

    url = 'https://news.sogou.com/news?query=' + company + '&page=' + str(page)

    res = requests.get(url, headers=headers, timeout=10)    # When there is no response after visiting the website for 10 seconds, it will stop visiting. timeout=10
    web_text = res.text

    # Get news date
    pattern_date = '<p class="news-from">.*?&nbsp;(.*?)</p>'
    date = re.findall(pattern_date, web_text, re.S)  # re.S Used to consider line breaks because.and*Does not contain line breaks
    # print(info)

    # Get news URL and title
    pattern_herf = '<h3 class="vrTitle">.*?<a href="(.*?)"'
    href = re.findall(pattern_herf, web_text, re.S)
    # print(href)

    pattern_title = '<h3 class="vrTitle">.*?>(.*?)</a>'
    title = re.findall(pattern_title, web_text, re.S)
    # print(title)

    # Data cleaning
    for i in range(len(title)):
        title[i] = re.sub('<.*?>', '', title[i])
        title[i] = re.sub('&.*?;', '', title[i])
        date[i] = re.sub('<.*?>', '', date[i])

    file_ = open('Sogou Data Mining Report.txt', 'a')     #Append mode, do not clear the original data
    file_.write(company + str(i+1) + 'Page news data:' + '\n' + '\n')
    for i in range(len(title)):
        file_.write(str(i+1) + '.' + title[i] + '(' + date[i] + ')' + '\n')
        file_.write(href[i] + '\n')

    file_.write('----------------------' + '\n' + '\n')
    file_.close()

def main():
    companys = ['Huaneng trust', 'tencent', 'Alibaba']
    for company in companys:
        for i in range(5):      # Crawling 5 pages
            try:
                sougou_news(company, i+1)
                print(company + str(i+1) + 'Page news crawled successfully!')
            except:
                print(company + str(i+1) + 'Page news crawling failed!')

if __name__ == '__main__':
    main()


Tags: Python network Windows Mobile

Posted on Tue, 19 May 2020 23:42:32 -0700 by iriedodge