Topic of the 2020 Pyton Developer Online Technology Summit: Technology Implementation of the Crawler Framework and Experience Sharing with Module Applications

Article Directory

1. Preface

On February 15, CSDN joined forces such as PyCon China, wuhan2020 and xinguan2020 to hold the 2020 Python Developer Day Online Technology Summit on the theme of "Fighting the epidemic, Developer Action". It focused on Python's specific landing applications and projects in the epidemic, and revealed the power of code disclosure for Python developers and enthusiasts.

When I receive an invitation from the sponsor, my first feeling is pressure and responsibility.Because the background of this activity is the current epidemic situation, all walks of life are helping Hubei, all eyes are focused on Wuhan, which is affecting the hearts of hundreds of millions of people.At that moment, I couldn't help but write the following text on ppt:


This is a purely public service activity with no interest involved.Participants can choose to join free of charge or pay 19 yuan to join, completely voluntary.If there is income, all of it will be donated by the sponsor to the areas in urgent need of assistance.

All the code in this article has been uploaded to my Github:https://github.com/xu5/2020Pyday and downloaded if necessary.

2. Some of the concepts we must know about Crawlers

Crawler is probably one of the first technologies that Pythoneer touched and used most. It seems simple, but it involves many technical fields such as network communication, application protocol, html/CSS/js, data analysis, service framework, etc. Beginners are often not easy to control, and many people even equate crawler to a popular crawl library, such as scrapy.I think concepts are the cornerstone of theory and ideas are the pioneers of code.Understanding the basic concepts and principles, and then writing to use crawlers, will get twice the result with half the effort.

2.1 Definition of a crawl

  • Definition 1: crawler refers to a program that automatically grabs information from the Internet, which is valuable to us.
  • Definition 2: A crawler, also known as a spider, is a web robot that automatically browses the World Wide Web.

Carefully, there is a subtle difference between the two: the former tends to crawl specific objects, and the latter tends to search the whole site or the whole web.

2.2 Legal risk of Crawlers

As a kind of computer technology, crawlers themselves are not prohibited by law, but using crawler technology to obtain data is illegal or even criminal risk:

  • Overload crawling, resulting in site paralysis or inaccessibility
  • Illegal intrusion into computer information systems
  • Crawl personal information
  • Privacy violation
  • Unfair Competition

2.3 Understanding crawl types from crawl scenarios

  • Focus on Web crawlers: for a specific object or target, usually transient
  • Incremental Web Crawler: Crawls only the incremental portion, which means that crawls are frequent or periodic activities
  • Deep Web Crawler: For data that cannot be obtained through static links, hidden behind forms such as login and search, only users can submit necessary information to obtain it
  • Universal Web crawler: Also known as full-web crawler, collects data for portal site search engines and large Web service providers

2.4 Basic techniques and crawl framework for Crawlers

A basic crawler framework consists of at least three parts: dispatch server, data downloader and data processor, which correspond to three basic crawler technologies: dispatch service framework, data capture technology and data preprocessing technology.


The figure above is a framework that we have been using in recent years. It has an additional management platform than the basic framework for configuring download tasks, monitoring the working status of parts of the system, monitoring the arrival of data, balancing the load on each node, analyzing the continuity of download data, completing or re-downloading data, and so on.

3. Data Grabbing Technology

Typically, we use the standard module urllib, or third-party modules such as requests/pycurl to capture data, and sometimes the automated test tool selenium module.Of course, there are many encapsulated frameworks available, such as pyspider/scrapy.Key technical points for capturing data include:

  • Construct and send requests: methods, headers, parameters, cookie files
  • Receive and interpret responses: response code, response type, response content, encoding format
  • A single data capture, often consisting of multiple requests-responses

3.1 Tencent NPC epidemic data download

With a little analysis, we can easily get from Tencent's Real-time epidemic tracking On the site, get its data service url:

https://view.inews.qq.com/g2/getOnsInfo

And three QueryString parameters:

  • name: disease_h5
  • Callback:callback function
  • _: Time stamp accurate to milliseconds

Next, the water is ready:

>>> import time, json, requests
>>> url = 'https://view.inews.qq.com/g2/getOnsInfo?name=disease_h5&callback=&_=%d'%int(time.time()*1000)
>>> data = json.loads(requests.get(url=url).json()['data'])
>>> data.keys()
dict_keys(['lastUpdateTime', 'chinaTotal', 'chinaAdd', 'isShowAdd', 'chinaDayList', 'chinaDayAddList', 'dailyNewAddHistory', 'dailyDeadRateHistory', 'confirmAddRank', 'areaTree', 'articleList'])
>>> d = data['areaTree'][0]['children'] >>> [item['name'] for item in d]
['Hubei', 'Guangdong', 'Henan', 'Zhejiang', 'Hunan', 'Anhui', 'Jiangxi', 'Jiangsu', 'Chongqing', 'Shandong', ..., 'Hong Kong', 'Taiwan', 'Qinghai', 'Macao', 'Tibet']
>>> d[0]['children'][0]
{'name': 'Wuhan', 'today': {'confirm': 1104, 'suspect': 0, 'dead': 0, 'heal': 0, 'isUpdated': True}, 'total': {'confirm': 19558, 'suspect': 0, 'dead': 820, 'heal': 1377, 'showRate': True, 'showHeal': False, 'deadRate': 4.19, 'healRate': 7.04}}

For more detailed instructions, please refer to Python Actual Warfare: Grab real-time data on pneumonia epidemic and draw 2019-nCoV epidemic map.

3.2 Modis Data Download

Modis is an important sensor mounted on the TERRA and AQUA remote sensing satellites. It is the only satellite-borne instrument that broadcasts real-time observation data directly to the world through the x-band and can receive data free of charge.The spectrum range is wide: there are 36 bands, ranging from 0.4um to 14.4um.The steps to download Modis data are as follows:

  1. Request https://urs.earthdata.nasa.gov/home by GET
  2. Resolve token from response
  3. Construct the form, fill in the username password and token
  4. Request https://urs.earthdata.nasa.gov/login as POST
  5. Record cookie s
  6. Request file download page by GET
  7. Resolve the url of the file download from the answer
  8. Download Files


Next, we use the requests module interactively in the Python IDLE to complete the process.Of course, the same functionality can be achieved using the pycurl module.I've provided both requests and pycurl implementation code on Github.

>>> import re
>>> from requests import request
>>> from requests.cookies import RequestsCookieJar
>>> resp = request('GET', 'https://urs.earthdata.nasa.gov/home')
>>> pt = re.compile(r'.*<input type="hidden" name="authenticity_token" value="(.*)" />.*')
>>> token = pt.findall(resp.text)[0]
>>> jar = RequestsCookieJar()
>>> jar.update(resp.cookies)
>>> url = 'https://urs.earthdata.nasa.gov/login'
>>> forms = {"username": "linhl", "redirect_uri": "", "commit": "Log+in", "client_id": "", "authenticity_token": token, "password": "*********"}
>>> resp = request('POST', url, data=forms, cookies=jar)
>>> resp.cookies.items()
[('urs_user_already_logged', 'yes'), ('_urs-gui_session', '4f87b3fd825b06ad825a666133481861')]
>>> jar.update(resp.cookies)
>>> url = 'https://ladsweb.modaps.eosdis.nasa.gov/archive/allData/6/MOD13Q1/2019/321/MOD13Q1.A2019321.h00v08.006.2019337235356.hdf'
>>> resp = request('GET', url, cookies=jar)
>>> pu = re.compile(r'href="(https://ladsweb.modaps.eosdis.nasa.gov.*hdf)"')
>>> furl = pu.findall(resp.text)[0]
>>> furl= furl.replace('&amp;', '&')
>>> resp = request('GET', furl, cookies=jar)
>>> with open(r'C:\Users\xufive\Documents\215PyCSDN\fb\modis_demo.hdf', 'wb') as fp:
	fp.write(resp.content)

The Modis data we downloaded is in HDF format.HDF(Hierarchical Data File), a multilayer data file, is a new data format developed by the National Center for Supercomputing Application (NCSA) to efficiently store and distribute scientific data to meet research needs in a variety of fields.HDF can represent many of the prerequisites for scientific data storage and distribution.HDF, as well as another data format file, netCDF, is not only used by Americans, but also in China, especially in the fields of space science, atmospheric science, geophysics, and so on. Almost all data distribution relies on these two formats.

This is the true face of Lushan, which has just downloaded the hdf data file:

3.3 Quark AI Search for NPC Epidemic Data

Quark AI Search for NPC Epidemic Data , the site can't grab data using normal methods, and the source code of the page and the content displayed on the page do not match at all (render).In the face of such websites, what other technical means do we have?Don't worry, I'll introduce you to an interesting data capture technology: as long as you can access the data through your browser's address bar, you can grab it, and you can really "grab it visible".

Visible, grabbable implementations depend on the selenium module.In fact, selenium is not a tool for data capture, but an automated test tool for testing websites, supporting a variety of browsers including Chrome, Firefox, Safari and other mainstream interface browsers.Grabbing data with selenium is not a common method because it only supports the GET method (of course, there are some extensions that can help selenium implement POST, such as installing the selenium requests module).

>>> from selenium import webdriver
>>> from selenium.webdriver.chrome.options import Options
>>> opt = Options()
>>> opt.add_argument('--headless')
>>> opt.add_argument('--disable-gpu')
>>> opt.add_argument('--window-size=1366,768')
>>> driver = webdriver.Chrome(options=opt)
>>> url = 'https://broccoli.uc.cn/apps/pneumonia/routes/index?uc_param_str=dsdnfrpfbivesscpgimibtbmnijblauputogpintnwktprchmt&fromsource=doodle'
>>> driver.get(url)
>>> with open(r'd:\broccoli.html', 'w') as fp:
	fp.write(driver.page_source)
	
247532
>>> driver.quit()

For more details on the installation and use of selenium modules, see Introduces an interesting data capture technique: Visible Grab.

4. Data Preprocessing Technology

4.1 Common preprocessing techniques

Questions to consider for data preprocessing:

  • Is the data format standard?
  • Is the data complete?
  • What about nonconforming, incomplete data?
  • How to save?How do I distribute it?

Based on the above considerations, the following preprocessing techniques have been developed:

  • xml/html data parsing
  • Text Data Parsing
  • Data cleaning, checking, weighting, filling, interpolation, standardization
  • Data Storage and Distribution

4.2 Examples of analysis: geomagnetic index (dst)

The geomagnetic index is a graded index describing the intensity of magnetic disturbance over a period of time.The geomagnetic index used by observers at low and middle latitudes is called the Dst index, which is measured hourly, mainly to measure changes in the intensity of the horizontal component of the geomagnetic field. this The site provides Dst Index downloads, and the page provides Dst Index for every hour of every day last month.The whole process is as follows:

  1. Grab the html page and use requests
  2. Parse text data from html, save it as a data file, use bs4
  3. Parse text data, save as a two-dimensional table, use regular expressions

We also implement this process interactively in Python IDLE:

>>> import requests
>>> html = requests.get('http://wdc.kugi.kyoto-u.ac.jp/dst_realtime/lastmonth/index.html')
>>> with open(r'C:\Users\xufive\Documents\215PyCSDN\dst.html', 'w') as fp:
	fp.write(html.text) 

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(html.text, "lxml")
>>> data_text = soup.pre.text
>>> with open(r'C:\Users\xufive\Documents\215PyCSDN\dst.txt', 'w') as fp:
	fp.write(data_text)

>>> import re
>>> r = re.compile('^\s?(\d{1,2})\s+(-?\d+)\s+(-?\d+)\s+(-?\d+)\s+(-?\d+)\s+(-?\d+)\s+(-?\d+)\s+(-?\d+)\s+(-?\d+)\s+(-?\d+)\s+(-?\d+)\s+(-?\d+)\s+(-?\d+)\s+(-?\d+)\s+(-?\d+)\s+(-?\d+)\s+(-?\d+)\s+(-?\d+)\s+(-?\d+)\s+(-?\d+)\s+(-?\d+)\s+(-?\d+)\s+(-?\d+)\s+(-?\d+)\s+(-?\d+)$', flags=re.M)
>>> data_re = r.findall(data_text)
>>> data = dict()
>>> for day in data_re:
	data.update({int(day[0]):[int(day[hour+1]) for hour in range(24)]})

5. Data Deep Processing Technology

5.1 Data Visualization

Data visualization is the visual representation of data, which aims to convey and communicate information clearly and effectively by means of graphics.It is an evolving concept whose boundaries are expanding.In fact, data visualization is also considered one of the means of data mining.

matplotlib, Python's most influential 2D drawing library, provides a complete set of command API s similar to Matlab that are well suited for interactive mapping.It can also be easily embedded in GUI applications as a drawing control.matplotlib can draw many forms of graphics including ordinary line graphs, histograms, pie graphs, scatterplots, error line graphs, etc. It can easily customize the various properties of graphics such as the type of line, color, thickness, font size, etc. It can support a part of TeX typesetting commands, and can display mathematical formulas in graphics beautifully.

Although Matplotlib focuses primarily on drawing and is primarily two-dimensional, it also has a number of different extensions that allow us to draw on a geographic map, allowing us to combine Excel with 3D charts.In the world of matplotlib, these extensions are called toolkits.Toolkits are a collection of specific functions that focus on a topic such as 3D drawing.More popular toolkits are Basemap, GTK, Excel, Natgrid, AxesGrid, mplot3d, and so on.

Pyecharts is also a great drawing library, especially its Geo geographic coordinate system is powerful and easy to use.I've known it since its js version echarts.However, the disadvantage of Pyecharts is also prominent: first, there is no continuity in version changes, and second, there is no support for TeX typesetting commands.Especially the second problem seriously restricts the development space of Pyecharts.

For data 3D visualization, PyOpenGL is recommended, along with VTK / Mayavi / Vispy.I also have a 3D library that is already open source at https://github.com/xufive/wxgl.Please refer to for details Open Source My 3D Library WxGL:40 lines of code to turn an epidemic map into a three-dimensional earth model.

For Matplotlib, refer to one of my posts: Mathematical Modeling Three Swordsmen MSN .About Basemap, I've recently been Python Actual Warfare: Grab real-time pneumonia data and draw 2019-nCoV epidemic map There are more detailed application examples in this paper.

5.2 Data Mining

Data mining refers to the process of revealing hidden, previously unknown and potentially valuable information through algorithms from a large amount of data.Data mining is a decision support process, which is based on statistics, database, visualization technology, artificial intelligence, machine learning, pattern recognition and other technologies to analyze data with high automation.

Next, we will simply demonstrate the curve fitting technology by taking the number of NCP cases diagnosed daily throughout the country as an example.Curve fitting is mostly used for trend prediction. The commonly used fitting methods are least squares curve fitting and objective function fitting.

# -*- coding: utf-8 -*-

import time, json, requests
import numpy as np
import matplotlib.pyplot as plt
from scipy import optimize

plt.rcParams['font.sans-serif'] = ['FangSong']  # Set Default Font
plt.rcParams['axes.unicode_minus'] = False  # Resolve the problem of'-'displaying as a square when saving an image

def get_day_list():
    """Get Daily Data"""
    
    url = 'https://view.inews.qq.com/g2/getOnsInfo?name=disease_h5&callback=&_=%d'%int(time.time()*1000)
    data = json.loads(requests.get(url=url).json()['data'])['chinaDayList']
    return [(item['date'], item['confirm']) for item in data]

def fit_exp():
    """fitting"""
    
    def func(x, a, b):
        return np.power(a, (x+b)) # Exponential function y = a^(x+b)
        
    _date, _y = zip(*get_day_list())
    _x = np.arange(len(_y))
    x = np.arange(len(_y)+1)
    
    fita, fitb = optimize.curve_fit(func, _x, _y, (2,0))
    y = func(x, fita[0], fita[1]) # fita is the optimal fitting parameter
    
    plt.plot(_date, _y, label='Raw data')
    plt.plot(x, y, label='$%0.3f^{x+%0.3f}$'%(fita[0], fita[1]))
    plt.legend(loc='upper left')
    plt.gcf().autofmt_xdate() # Optimized labeling (automatic tilting)
    plt.grid(linestyle=':') # show grid
    plt.show()

if __name__ == '__main__':
    fit_exp()

The fitting results are as follows:

At present, the whole country is concentrating on fighting against viruses and the epidemic situation has gradually stabilized. We use exponential function as the fitting target, and the deviation will become larger and larger in the later period, but at the beginning, this fitting method has certain reference value for trend prediction.

5.3 Data Services

The so-called data service is to provide data for crawlers to grab.Python has many mature web frameworks, such as Django, Tornado, Falsk, and so on, that make it easy to implement data services.Of course, in addition to the service framework, data services also need databases.Because of time constraints, here's a simple demonstration of the most economical data servers:

PS D:\XufiveGit\2020Pyday\fb> python -m http.server
Serving HTTP on 0.0.0.0 port 8000 (http://0.0.0.0:8000/) ...

Try it. The browser becomes a file browser.

6. Scheduling Service Framework

As mentioned earlier, a basic crawler framework consists of at least three components: a dispatch server, a data downloader, and a data processor.We'll use these three parts to demonstrate a minimal crawler framework.

6.1 Scheduling Service Module

APScheduler is my favorite module for dispatching services. Its full name is Advanced Python Scheduler.This is a lightweight Python timed task scheduling framework that is very powerful.APScheduler has many triggers. In the code below, a cron trigger is used, which is the most complex trigger in APScheduler. It supports the cron syntax and allows you to set very complex triggers.

6.2 Mini Crawler Frame

The entire code, including comments, only has more than fifty lines, but it can capture data from Tencent Epidemic Data Service site every 10 minutes and parse the data file saved in csv format.

# -*- coding: utf-8 -*-

import os, time, json, requests
import multiprocessing as mp
from apscheduler.schedulers.blocking import BlockingScheduler

def data_obtain():
    """get data"""
    
    url = 'https://view.inews.qq.com/g2/getOnsInfo?name=disease_h5&callback=&_=%d'%int(time.time()*1000)
    with open('fb/ncp.txt', 'w') as fp:
        fp.write(requests.get(url=url).json()['data'])
    
    print('Obtain OK')

def data_process():
    """Processing data"""
    
    while True:
        if os.path.isfile('fb/ncp.txt'):
            with open('fb/ncp.txt', 'r') as fp:
                data = json.loads(fp.read())
            
            with open('fb/ncp.csv', 'w') as fp:
                for p in data['areaTree'][0]['children']:
                    fp.write('%s,%d,%d,%d,%d\n'%(p['name'], p['total']['confirm'], p['total']['suspect'], p['total']['dead'], p['total']['heal']))
            
            os.remove('fb/ncp.txt')
            print('Process OK')
        else:
            print('No data file')
        
        time.sleep(10)

if __name__ == '__main__':
    # Create and start a data processing subprocess
    p_process = mp.Process(target=data_process) # Create Data Processing Subprocess
    p_process.daemon = True  # Set child process as daemon
    p_process.start() # Start Data Processing Subprocess
    
    # Create Scheduler
    scheduler = BlockingScheduler() 
    
    # Add Task
    scheduler.add_job(
        data_obtain,            # Tasks for getting data
        trigger = 'cron',       # Set trigger to cron     
        minute = '*/1',         # Set to execute every minute
        misfire_grace_time = 30 # Abandon execution if this job is not executed in 30 seconds
    )
    
    # Start Scheduling Service
    scheduler.start() 
90 original articles published. 9857 praised. 128,000 visits+
His message board follow

Tags: Python Selenium JSON github

Posted on Wed, 12 Feb 2020 18:00:13 -0800 by jumphopper