How Python Crawls Real-time Changing WebSocket Data

I. Preface

As a crawler engineer, we often encounter the need to crawl real-time data in our work, such as real-time data of sports events, real-time data of stock market or real-time data of currency circle changes. The following picture:


In the field of Web, polling and WebSocket are two ways to update data in real time. Polling refers to the client accessing the server interface at a certain time interval (e.g. 1 second) to achieve the'real-time'effect. Although the data looks like real-time updates, in fact it has a certain time interval, not real-time updates. Polling usually adopts pull mode, in which the client pulls data from the server on its own initiative.

WebSocket uses push mode, which is a real real-time update mode, in which the server actively pushes the data to the client.

2. What is WebSocket

WebSocket is a protocol for full-duplex communication over a single TCP connection. It makes the data exchange between client and server simpler and allows server to push data to client actively. In WebSocket API, browsers and servers only need to shake hands once, so they can directly create a persistent connection and carry out bidirectional data transmission.

Advantages of WebSocket

  • Less control overhead: Only one handshake, one request header information can be carried, and then only data can be transmitted. Compared with HTTP, WebSocket is very resource-efficient.
  • More real-time: Because the server can actively push messages, this makes the delay negligible, and WebSocket can transmit multiple times in the same time compared to the time interval of HTTP polling.
  • Binary support: WebSocket supports binary frames, which means less transmission.
  • ......

Crawlers face HTTP and WebSocket

There are many network request libraries in Python. Requests is one of the most commonly used request libraries. It can simulate sending network requests. But these requests are based on the HTTP protocol. Requests play an unexpected role in the face of WebSocket, and libraries that can connect to WebSocket must be used.

3. Crawling Train of Thought

Here's an example of real-time data from Wright coin website http://www.laiteb.com. WebSocket handshake only happens once, so if you need to observe network requests through the browser developer tool, you need to open the browser developer tool, locate the NewWork tab, and enter or refresh the current page to observe the handshake request and data transmission of WebSocket. Situation. Take Chrome Browser as an example:

A filtering function is provided in the developer's tool, where the WS option represents a network request that displays only WebSocket connections.

At this point, you can see a record named realTime in the request record list. After clicking on it with the left mouse button, the developer's tool will be divided into two columns, the right side lists the details of this request record:

Unlike HTTP requests, WebSocket connection addresses start with WS or ws. The status code for successful connection is not 200, but 101.

The Headers tab records the Request and Response information, while the Frames tab records the data transferred between the two sides, which is also the data content we need to crawl.

The data from the green arrow up in the Frames diagram is the data sent by the client to the server, while the data from the orange arrow down is the data pushed by the server to the client.

As you can see from the data sequence, the client sends first:

{"action":"subscribe","args":["QuoteBin5m:14"]}

Then the server will push the message (push all the time):

{"group":"QuoteBin5m:14","data":[{"low":"55.42","high":"55.63","open":"55.42","close":"55.59","last_price":"55.59","avg_price":"55.5111587372932781077","volume":"40078","timestamp":1551941701,"rise_fall_rate":"0.0030674846625766871","rise_fall_value":"0.17","base_coin_volume":"400.78","quote_coin_volume":"22247.7621987324"}]}

Therefore, the whole process from initiating the handshake to obtaining the data is as follows:

Now the question arises:

  • How do you shake hands?
  • How to keep the connection?
  • How to send and receive messages?
  • Are there any libraries that can be easily implemented?

IV. aiowebsocket

There are a lot of Python libraries used to connect WebSockets, but the easy-to-use and stable ones are WebSocket-client (non-asynchronous), WebSockets (asynchronous), aiowebsocket (asynchronous).

You can choose one of the three according to the project requirements. Today we introduce the asynchronous WebSocket connection client aiowebsocket. Its Github address is https://github.com/asyncins/aiowebsocket.

ReadMe explains that AioWebSocket is an asynchronous WebSocket client that follows the WebSocket specification and is lighter and faster than other libraries.

Its installation is as simple as other libraries, using pip install aiowebsocket. After installation, we can test it according to the sample code provided in ReadMe:

import asyncio
import logging
from datetime import datetime
from aiowebsocket.converses import AioWebSocket


async def startup(uri):
    async with AioWebSocket(uri) as aws:
        converse = aws.manipulator
        message = b'AioWebSocket - Async WebSocket Client'
        while True:
            await converse.send(message)
            print('{time}-Client send: {message}'
                  .format(time=datetime.now().strftime('%Y-%m-%d %H:%M:%S'), message=message))
            mes = await converse.receive()
            print('{time}-Client receive: {rec}'
                  .format(time=datetime.now().strftime('%Y-%m-%d %H:%M:%S'), rec=mes))


if __name__ == '__main__':
    remote = 'ws://echo.websocket.org'
    try:
        asyncio.get_event_loop().run_until_complete(startup(remote))
    except KeyboardInterrupt as exc:
        logging.info('Quit.')

After operation, the output is as follows:

2019-03-07 15:43:55-Client send: b'AioWebSocket - Async WebSocket Client'
2019-03-07 15:43:55-Client receive: b'AioWebSocket - Async WebSocket Client'
2019-03-07 15:43:55-Client send: b'AioWebSocket - Async WebSocket Client'
2019-03-07 15:43:56-Client receive: b'AioWebSocket - Async WebSocket Client'
2019-03-07 15:43:56-Client send: b'AioWebSocket - Async WebSocket Client'
......

send represents the message sent by the client to the server

recive represents the message pushed by the server to the client

V. Coding for Data Acquisition

Back to this crawl demand, the target website is Wright coin official website:

From the web request record just now, we know that the WebSocket address of the target website is wss://api.bbxapp.vip/v1/ifcontract/realTime. From the address, we can see that the target website uses wss, which is the secure version of ws. Their relationship is the same as HTTP/HTTPS. aiowebsocket automatically processes and recognizes ssl, so we don't need to do any additional operations, just assign the destination address to the connection uri:

import asyncio
import logging
from datetime import datetime
from aiowebsocket.converses import AioWebSocket


async def startup(uri):
    async with AioWebSocket(uri) as aws:
        converse = aws.manipulator
        while True:
            mes = await converse.receive()
            print('{time}-Client receive: {rec}'
                  .format(time=datetime.now().strftime('%Y-%m-%d %H:%M:%S'), rec=mes))


if __name__ == '__main__':
    remote = 'wss://api.bbxapp.vip/v1/ifcontract/realTime'
    try:
        asyncio.get_event_loop().run_until_complete(startup(remote))
    except KeyboardInterrupt as exc:
        logging.info('Quit.')

When you run the code and look at the output, you will see that nothing happens. There is neither content output nor disconnection, and the program is running all the time, but nothing:

Why is that?

Is it that the other party does not accept our request?

Or are there any anti-reptile restrictions?

In fact, the flow chart just now can explain this problem:

One step in the whole process is that the client needs to send the specified message to the server, and the server will not keep pushing data until the server verifies it. Therefore, the code for sending messages should be added before the message is read and after the handshake connection:

import asyncio
import logging
from datetime import datetime
from aiowebsocket.converses import AioWebSocket


async def startup(uri):
    async with AioWebSocket(uri) as aws:
        converse = aws.manipulator
        # Client sends message to server
        await converse.send('{"action":"subscribe","args":["QuoteBin5m:14"]}')
        while True:
            mes = await converse.receive()
            print('{time}-Client receive: {rec}'
                  .format(time=datetime.now().strftime('%Y-%m-%d %H:%M:%S'), rec=mes))


if __name__ == '__main__':
    remote = 'wss://api.bbxapp.vip/v1/ifcontract/realTime'
    try:
        asyncio.get_event_loop().run_until_complete(startup(remote))
    except KeyboardInterrupt as exc:
        logging.info('Quit.')

When it is saved and run, you will see the data pushed forward continuously.

Here, the crawler can get the data it wants.

What did aiowebsocket do?

The code is not long, you just need to fill in the WebSocket address of the target website, and then send data according to the process. So what does aiowebsocket do in this process?

  • First, aiowebsocket sends a handshake request to the specified server according to the WebSocket address, and verifies the handshake result.
  • Then, after confirming the successful handshake, the data is sent to the server.
  • In order to keep the connection open, aiowebsocket will automatically respond to ping pong with the server.
  • Finally, aiowebsocket reads messages pushed by the server


Author: Night Quin
 

Tags: AWS network less Python

Posted on Wed, 11 Sep 2019 20:54:17 -0700 by BillyBoB