The previous post introduced the use of Selenium The code is as follows:
def get_source_code(url): # Getting the source code of web page without interface browser chrome_option = webdriver.ChromeOptions() chrome_option.add_argument('--headless') browser = webdriver.Chrome(options=chrome_option) browser.get(url) data = browser.page_source return data
After getting the source code of the web page, we need to extract the data and observe the data rule of real-time stock information from the source code of the web page.
<div id="price" class="up">Price of stock</div>
It is found here that the id of Guizhou Maotai stock price is price, and id is similar to id number. Generally, the id in a web page will not be repeated, class represents category, and the closing price of Maotai stock price is rising on February 7, 2020, so it is speculated that up represents the rising category, and down represents the falling category. After verification, the speculation is correct.
sina_url = 'https://finance.sina.com.cn/realstock/company/sh600519/nc.shtml' source_data = get_source_code(sina_url) price_pattern = '<div id="price" class=.*?>(.*?)</div>' price = re.findall(price_pattern, source_data)
Dongfang fortune.com is a professional Internet financial media, providing 7 * 24-hour financial information and global financial market quotation, gathering comprehensive financial news and financial market information.
Here, you can get the news of Listed Companies in Dongfang fortune. If you access it through the Requests library, you can't get all the source code of the web page even with the headers parameter. Therefore, you can use the Selenium library to achieve the news crawling function.
The website is still accessed in the mode of browser without interface. Here, take A-share listed company BOE a as an example:
east_url = 'http://so.eastmoney.com/news/s?keyword = BOE A ' source_data = get_source_code(east_url) # print(source_data) # Writing regular expressions to extract data title_pattern = '<div class="news-item"><h3><a href=.*?>(.*?)</a>' href_pattern = '<div class="news-item"><h3><a href=(.*?)>.*?</a>' date_pattern = '<p class="news-desc">(.*?)</p>' title = re.findall(title_pattern, source_data) href = re.findall(href_pattern, source_data) date = re.findall(date_pattern, source_data) # Data cleaning for i in range(len(title)): title[i] = re.sub('<.*?>', '', title[i]) # The title contains < EM > date[i] = date[i].split(' ') # date data in the first location print(str(i+1) + '.' + title[i] + '-' + date[i]) print(href[i])
The execution result is:
1. Net sales of Shenzhen Stock connect for three consecutive days BOE a accumulated net sales of 272 million yuan-2020-02-07 "http://stock.eastmoney.com/news/1406,202002071376301774.html" 2. [research express] BOE A received 278 institutions such as Guotai Junan venture capital - 2020-02-07 "http://stock.eastmoney.com/news/11064,202002071376191946.html" 3. In February, the price increase of TV panel was higher than expected, and the impact on mobile panel was limited - 2020-02-07 "http://finance.eastmoney.com/news/1355,202002071376099760.html" 4. BOE A(000725) margin trading information (02-06) - 2020-02-07 "http://stock.eastmoney.com/news/1697,202002071375472172.html" 5. BOE A: net purchase of financing is 183 million yuan, ranking 12th (02-06) - 2020-02-06 in the two cities "http://stock.eastmoney.com/news/11075,202002071375376492.html"
Referee document network is the most authoritative website of effective referee document formula, which has a high reference value for internal risk control and public opinion monitoring of financial industry. Its homepage is shown in the figure below, and Selenium database continues to be used here to obtain website information.
Here, for example, to search for "real estate" related cases in the website, the function to be realized is to simulate the input of "real estate" in the search box and click the "search" button. To achieve this function, you need to first obtain the XPath or CSS [selector content of the search box and search button. Method details Selenium library use.
The implementation code is as follows:
wenshu_url = 'http://wenshu.court.gov.cn/' browser = webdriver.Chrome() browser.get(wenshu_url) browser.find_element_by_xpath('//*[@id="_view_1540966814000"]/div/div/div/input').clear() browser.find_element_by_xpath('//*[@id="_view_1540966814000"]/div/div/div/input').send_keys('Real estate') browser.find_element_by_xpath('//*[@id="_view_1540966814000"]/div/div/div').click() time.sleep(20) data = browser.page_source print(data)
It should be noted here that there are some default texts in the search box of the referee's website, so if you use the clear() function to clear the search box (the fourth line of the code) before inputting the keyword "real estate", the meaning of time.sleep(20) is that after clicking the search function, there is a loading process. In order to ensure that the source code obtained is correct, you need to wait for a while Time to get the source code.
Http / / www.cninfo.com is the information disclosure website of listed companies designated by China Securities Regulatory Commission. It is the first large-scale professional securities website in China that comprehensively discloses announcement information and market data of more than 3000 listed companies in Shenzhen and Shanghai.
The mining task to be carried out here is to get the title, website and release date of each listed company's announcement according to the specified keywords on cninfo.
- First of all, get the information of "virus" related announcements. Search for "virus" on cninfo. Com, you can see many announcements and get the source code of the web page
juchao_url = 'http://Www.cninfo.com.cn/new/fulltextsearch? Notautosubmit = & keyword = virus' data = get_source_code(juchao_url) # Extract data title_pattern = '<span title="" class="r-title">(.*?)</span>' href_pattern = '<a target="_blank" href="(.*?)" data-id=' date_pattern = '<span class="time">\s*(.*?)\s*</span>' title = re.findall(title_pattern, data) href = re.findall(href_pattern, data) date = re.findall(date_pattern, data) # Data cleaning for i in range(len(title)): title[i] = re.sub('<.*?>', '', title[i]) title[i] = title[i].strip() date[i] = date[i].strip() href[i] = 'http://www.cninfo.com.cn' + href[i] print(str(i+1) + '.' + title[i] + ' - ' + date[i]) print(href[i])
There is less than one prefix http://www.cninfo.com.cn for the website to be crawled.