Python3 Web Crawler Actual-38, Dynamic Rendering Page Grabbing: Use of Splash

Splash is a JavaScript rendering service, a lightweight browser with an HTTP API, and it connects to the Twisted and QT libraries in Python, with which we can also capture dynamically rendered pages.

1. Functional introduction

With Splash we can do the following:

  • Asynchronous rendering of multiple web pages
  • Get the source code or screenshot of the rendered page
  • Speed up page rendering by turning off picture rendering or using Adblock rules
  • Executable specific JavaScript scripts
  • Lua scripts allow you to control the page rendering process to get detailed rendering and render it in HAR (HTTP Archive) format

Next let's look at its specific use.

2. Preparations

Before beginning this section, make sure Splash is properly installed and the service is functioning properly. If it is not installed, refer to the installation instructions in Chapter 1.

3. Instance introduction

First, we can test Splash's rendering process with a Web page provided by Splash, such as running the Splash service on port 8050 and opening: http://localhost : 8050/You can see its Web page as shown in Figure 7-6:

Figure 7-6 Web Page
On the right is a rendering example, and we can see that there is an input box above, which by default is: http://google.com Here we change to Baidu test and change the content to: https://www.baidu.com And then click the button to start the rendering, as shown in Figures 7-7:

[External chain picture transfer failed (img-wYIGP9Th-1565082895571). ( https://upload-images.jianshu.io/upload_images/17885815-a053c5ffb0f8ea03.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)]

Figure 7-7 Running Results
You can see that the returned results of the page present rendered screenshots, HAR load statistics, and the source code of the page.
From the results of HAR, we can see that Splash performed the rendering process of the entire web page, including CSS, JavaScript loading, and so on, rendering the page exactly the same as what we get in the browser.
So what controls this process?When we return to the first page, we can see that there is actually a script that reads as follows:

function main(splash, args)
  assert(splash:go(args.url))
  assert(splash:wait(0.5))
  return {
    html = splash:html(),
    png = splash:png(),
    har = splash:har(),
  }
end
Python Resource Sharing qun 784758214 ,Installation packages are included. PDF,Learn video, here is Python The place where learners gather, zero base, advanced, all welcome

This script is actually written in Lua, a programming language that is simple to use.
Even if we don't understand the syntax of the language, we can see from the surface of the script that it first calls the go() method to load the page, then calls the wait() method to wait for a while, and finally returns the source code, screenshots, and HAR information of the page.
So here we can get a general idea that Splash controls the loading process of pages through Lua scripts, which fully mimic the browser and ultimately return results in various formats, such as source code, screenshots, and so on.
So next we need to learn how to use Splash, first we need to understand how to write Lua scripts, and second we need to understand the usage of related API s, so let's take a look at these two parts.

4. Splash Lua script

Splash can perform a series of rendering operations through a Lua script so that we can use Splash to simulate operations like Chrome and PhantomJS.
First, let's have a basic understanding of the usage of Splash Lua scripts, and first, let's look at how they enter and execute.

Entry and Return Values

Let's start with a basic example:

function main(splash, args)
  splash:go("http://www.baidu.com")
  splash:wait(0.5)
  local title = splash:evaljs("document.title")
  return {title=title}
end

We'll paste the code into what we just opened: http://localhost : 8050/code editing area, then click the button to test it.
So we see that it returns the title of the page, where we pass in the JavaScript script through the evaljs() method, and the result of document.title execution is to return the title of the page, assign it to a title variable after execution, and then return it, so you can see that its return result is the title of the pageYes, as shown in Fig. 7-8:

[External chain picture transfer failed (img-Ynaq32CQ-1565082895572). ( https://upload-images.jianshu.io/upload_images/17885815-37d5a25df0739240.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)]

Figure 7-8 Running Results
Notice that the method name we define here is called main(), which must be fixed, and Splash calls this method by default.
The return value of a method can be either a dictionary or a string and will eventually be converted to a Splash HTTP Response, for example:

function main(splash)
    return {hello="world!"}
end

This returns content in the form of a dictionary.

function main(splash)
    return 'hello'
end

This returns a string of content, which is also possible.

Asynchronous Processing

Splash supports asynchronous processing, but here we do not explicitly specify a callback method whose jumps are done inside Splash. Let's start with an example:

function main(splash, args)
  local example_urls = {"www.baidu.com", "www.taobao.com", "www.zhihu.com"}
  local urls = args.urls or example_urls
  local results = {}
  for index, url in ipairs(urls) do
    local ok, reason = splash:go("http://" .. url)
    if ok then
      splash:wait(2)
      results[url] = splash:png()
    end
  end
  return results
end

The result is a screenshot of three sites, as shown in Figures 7-9:

Figure 7-9 Running Results
The wait() method is called within the script, similar to sleep() in Python, and the parameter is the number of seconds to wait. When Splash executes this method, it will switch to other tasks and come back later to continue processing.
It is worth noting here that unlike Python, string splicing in Lua scripts uses the..operator instead of +. If necessary, a simple look at the syntax of Lua scripts, links: http://www.runoob.com/lua/lua....
In addition, here we do exception detection during loading. The go() method returns the result state of the loaded page. If the page has a 4XX or 5XX status code, the ok variable will be empty, and the loaded picture will not be returned.

5. Splash object properties

We noticed in the previous example that the first argument to the main() method is splash, which is very important, similar to the WebDriver object in Elenium:

from selenium import webdriver
browser = webdriver.Chrome()

As shown above, now the splash object is like the browser object in Selenium here. We can call some of its properties and methods to control the loading process. Let's first look at its properties.

args

The splash object's args property takes the parameters that are configured at load time, it gets the loaded URL, it gets the GET request parameters if requested for GET, and it gets the data submitted by the form if requested for POST.Splash supports the second parameter directly as args, for example:

function main(splash, args)
    local url = args.url
end

Here the second parameter args is equivalent to the splash.args attribute, and the code above is equivalent to:

function main(splash)
    local url = splash.args.url
end

js_enabled

This property is Splash's JavaScript execution switch, which we can configure to True or False to control whether JavaScript code can be executed, defaulting to True, for example, if we disable JavaScript execution here:

function main(splash, args)
  splash:go("https://www.baidu.com")
  splash.js_enabled = false
  local title = splash:evaljs("document.title")
  return {title=title}
end

When disabled, we call the evaljs() method again to execute JavaScript code, and the result will throw an exception:

{
    "error": 400,
    "type": "ScriptError",
    "info": {
        "type": "JS_ERROR",
        "js_error_message": null,
        "source": "[string \"function main(splash, args)\r...\"]",
        "message": "[string \"function main(splash, args)\r...\"]:4: unknown JS error: None",
        "line_number": 4,
        "error": "unknown JS error: None",
        "splash_method": "evaljs"
    },
    "description": "Error happened while executing Lua script"
}

In general, however, we don't need to set this property switch, just turn it on by default.

resource_timeout

This property can set the time-out for loading in seconds. If set to 0 or nil (like None in Python) means no time-out is detected, let's take a look at an example:

function main(splash)
    splash.resource_timeout = 0.1
    assert(splash:go('https://www.taobao.com'))
    return splash:png()
end

For example, here we will set the timeout time to 0.1 seconds and throw an exception if no response is received within 0.1 seconds. The error is as follows:

{
    "error": 400,
    "type": "ScriptError",
    "info": {
        "error": "network5",
        "type": "LUA_ERROR",
        "line_number": 3,
        "source": "[string \"function main(splash)\r...\"]",
        "message": "Lua error: [string \"function main(splash)\r...\"]:3: network5"
    },
    "description": "Error happened while executing Lua script"
}

This property is suitable for settings where the loading speed of a web page is slow, and if there is no response beyond a certain time, throw an exception and ignore it.

images_enabled

This property allows you to set whether pictures are loaded or not, which is loaded by default, but when disabled, you can save network traffic and increase the loading speed of web pages, but it is worth noting that disabling pictures may affect JavaScript rendering because the height of its outer DOM node will be affected when pictures are disabled.This in turn affects the location of the DOM node, so if JavaScript uses related variables, its execution will be affected, but generally not.
It's also worth noting that Splash uses caching, so if you initially load a web page picture, then disable the picture loading, and then reload the page, the previously loaded picture may still appear, so you can restart Splash to solve this problem.
Examples of disabling picture loading are as follows:

function main(splash, args)
  splash.images_enabled = false
  assert(splash:go('https://www.jd.com'))
  return {png=splash:png()}
end

This returns a screenshot of the page without any pictures and loads much faster.

plugins_enabled

This property controls whether browser plug-ins, such as Flash plug-ins, are turned on.By default, this property is False is not turned on and can be turned on and off using the following code:

splash.plugins_enabled = true/false

scroll_position

This property can control the scrolling offset of the page. By setting this property, we can control whether the page scrolls up, down, left or right, or it is a common property. Let's take an example to see:

function main(splash, args)
  assert(splash:go('https://www.taobao.com'))
  splash.scroll_position = {y=400}
  return {png=splash:png()}
end

This allows us to control the page to scroll down 400 pixels, as shown in Figures 7-10:

[External chain picture transfer failed (img-Nb3jZnGZ-1565082895575). ( https://upload-images.jianshu.io/upload_images/17885815-4bde1b0c3e5329f3.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)]

Figure 7-10 Running Results
If you want to control scrolling left and right, you can pass in the x parameter, code as follows:

splash.scroll_position = {x=100, y=200}

6. Splash object method

go()

The go() method is the method used to request a link, and it can simulate GET and POST requests, while supporting incoming data such as Headers, Form Data, and so on, as follows:

ok, reason = splash:go{url, baseurl=nil, headers=nil, http_method="GET", body=nil, formdata=nil}

The parameters are described as follows:

  • Url, which is the requested URL.
  • baseurl, optional parameter, empty by default, resource load relative path.
  • Headers, optional parameter, empty by default, requested headers.
  • http_method, optional parameter, defaults to GET, and supports POST.
  • body, optional parameter, empty by default, form data when POST, application/json using Content-type.
  • formdata, optional parameter, blank by default, form data when POST, use Content-type as application/x-www-form-urlencoded.

The result returned is a combination of the result ok and the reason reason reason. If ok is empty, there is an error in the page loading, and the reason for the error is contained in the reason variable. Otherwise, it proves that the page was loaded successfully. An example is as follows:

function main(splash, args)
  local ok, reason = splash:go{"http://httpbin.org/post", http_method="POST", body="name=Germey"}
  if ok then
        return splash:html()
  end
end
Python Resource Sharing qun 784758214 ,Installation packages are included. PDF,Learn video, here is Python The place where learners gather, zero base, advanced, all welcome

Here we simulate a POST request and pass in the POST form data, which, if successful, returns the page source code.
The results are as follows:

<html><head></head><body><pre style="word-wrap: break-word; white-space: pre-wrap;">{
  "args": {}, 
  "data": "", 
  "files": {}, 
  "form": {
    "name": "Germey"
  }, 
  "headers": {
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", 
    "Accept-Encoding": "gzip, deflate", 
    "Accept-Language": "en,*", 
    "Connection": "close", 
    "Content-Length": "11", 
    "Content-Type": "application/x-www-form-urlencoded", 
    "Host": "httpbin.org", 
    "Origin": "null", 
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/602.1 (KHTML, like Gecko) splash Version/9.0 Safari/602.1"
  }, 
  "json": null, 
  "origin": "60.207.237.85", 
  "url": "http://httpbin.org/post"
}
</pre></body></html>

The result shows that we successfully implemented the POST request and sent the form data.

wait()

This method can control the page waiting time by using the following methods:

ok, reason = splash:wait{time, cancel_on_redirect=false, cancel_on_error=true}

The parameters are described as follows:

  • time, the number of seconds to wait.
  • cancel_on_redirect, optional parameter, default False, stops waiting if a redirect occurs and returns the redirect result.
  • cancel_on_error, optional parameter, default False, stops waiting if a load error occurs.

The returned result is also a combination of the result ok and the reason reason.
Let's take an example:

function main(splash)
    splash:go("https://www.taobao.com")
    splash:wait(2)
    return {html=splash:html()}
end

This allows you to access Taobao and wait 2 seconds before returning to the page source code.

jsfunc()

This method can call JavaScript-defined methods directly, and needs to be surrounded by double square brackets, which is equivalent to the conversion of JavaScript methods to Lua scripts. An example is as follows:

function main(splash, args)
  local get_div_count = splash:jsfunc([[
  function () {
    var body = document.body;
    var divs = body.getElementsByTagName('div');
    return divs.length;
  }
  ]])
  splash:go("https://www.baidu.com")
  return ("There are %s DIVs"):format(
    get_div_count())
end

Run result:

There are 21 DIVs

Preferred We declared a method and called it after the page was loaded successfully to calculate the number of div nodes in the page.
But this is just the Web page functionality Splash provides, and more of it allows us to use the HTTP API it provides to complete the JavaScript rendering process.
More details on the conversion of JavaScript to Lua scripts can be found in the official documentation: https://splash.readthedocs.io....

evaljs()

This method executes JavaScript code and returns the result of the last statement as follows:

result = splash:evaljs(js)

For example, we can use the following code to get the title of a page:

local title = splash:evaljs("document.title")
runjs()

This method can execute JavaScript code similar to evaljs(), but it prefers to perform certain actions or declare certain methods, and evaljs() prefers to obtain certain execution results, such as:

function main(splash, args)
  splash:go("https://www.baidu.com")
  splash:runjs("foo = function() { return 'bar' }")
  local result = splash:evaljs("foo()")
  return result
end
Python Resource Sharing qun 784758214 ,Installation packages are included. PDF,Learn video, here is Python The place where learners gather, zero base, advanced, all welcome

Here we use runjs() to declare a JavaScript-defined method and then call it evaljs() to get the result.
The results are as follows:

bar

autoload()

This method can set objects that are automatically loaded when each page is visited, using the following methods:

ok, reason = splash:autoload{source_or_url, source=nil, url=nil}

The parameters are described as follows:

  • source_or_url, JavaScript code, or JavaScript library link.
  • source, JavaScript code.
  • url, JavaScript library link

However, this method is only responsible for loading JavaScript code or libraries and does nothing. If you want to do something, you can call the evaljs() or runjs() methods, as shown in the following examples

function main(splash, args)
  splash:autoload([[
    function get_document_title(){
      return document.title;
    }
  ]])
  splash:go("https://www.baidu.com")
  return splash:evaljs("get_document_title()")
end

Here we call autoload() to declare a JavaScript method, and then we call it through evaljs().
Run result:

Baidu once, you know

We can also load some method libraries, such as jQuery, as shown below:

function main(splash, args)
  assert(splash:autoload("https://code.jquery.com/jquery-2.1.3.min.js"))
  assert(splash:go("https://www.taobao.com"))
  local version = splash:evaljs("$.fn.jquery")
  return 'JQuery version: ' .. version
end

Run result:

JQuery version: 2.1.3

call_later()

This method can delay the execution of tasks by setting timed tasks and delay times, and can re-execute timed tasks by cancel() method before execution, as shown in the following example:

function main(splash, args)
  local snapshots = {}
  local timer = splash:call_later(function()
    snapshots["a"] = splash:png()
    splash:wait(1.0)
    snapshots["b"] = splash:png()
  end, 0.2)
  splash:go("https://www.taobao.com")
  splash:wait(3.0)
  return snapshots
end

Here we set a timer task to get a screenshot of the web page in 0.2 seconds, then wait for 1 second, then get a screenshot of the web page again in 1.2 seconds, visit the page Taobao, and finally return the results of the screenshot.
The results are shown in Figure 7-11:

[External chain picture transfer failed (img-CR8NPo3D-1565082895576)]( https://upload-images.jianshu.io/upload_images/17885815-6ae7ea423aed649c.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)]

Figure 7-11 Running Results
We can see that the first screenshot web page has not been loaded yet, the screenshot is empty, and the second page loads successfully.

http_get()

This method simulates GET requests that send HTTP using the following methods:

response = splash:http_get{url, headers=nil, follow_redirects=true}

The parameters are described as follows:

  • Url, request URL.
  • Headers, optional parameter, empty by default, requested headers.
  • follow_redirects, optional parameter, defaults to True, whether automatic redirection is started.Let's take a look at an example:
function main(splash, args)
  local treat = require("treat")
  local response = splash:http_get("http://httpbin.org/get")
    return {
    html=treat.as_string(response.body),
    url=response.url,
    status=response.status
    }
end

Run result:

Splash Response: Object
html: String (length 355)
{
  "args": {}, 
  "headers": {
    "Accept-Encoding": "gzip, deflate", 
    "Accept-Language": "en,*", 
    "Connection": "close", 
    "Host": "httpbin.org", 
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/602.1 (KHTML, like Gecko) splash Version/9.0 Safari/602.1"
  }, 
  "origin": "60.207.237.85", 
  "url": "http://httpbin.org/get"
}
status: 200
url: "http://httpbin.org/get"

http_post()

Similar to the http_get() method, this method simulates sending a POST request, but with one more parameter body, as follows:

response = splash:http_post{url, headers=nil, follow_redirects=true, body=nil}

The parameters are described as follows:

  • Url, request URL.
  • Headers, optional parameter, empty by default, requested headers.
  • follow_redirects, optional parameter, defaults to True, whether automatic redirection is started.body, optional parameter, blank by default, is the form data.Let's take an example:
function main(splash, args)
  local treat = require("treat")
  local json = require("json")
  local response = splash:http_post{"http://httpbin.org/post",     
      body=json.encode({name="Germey"}),
      headers={["content-type"]="application/json"}
    }
    return {
    html=treat.as_string(response.body),
    url=response.url,
    status=response.status
    }
end

Run result:

Splash Response: Object
html: String (length 533)
{
  "args": {}, 
  "data": "{\"name\": \"Germey\"}", 
  "files": {}, 
  "form": {}, 
  "headers": {
    "Accept-Encoding": "gzip, deflate", 
    "Accept-Language": "en,*", 
    "Connection": "close", 
    "Content-Length": "18", 
    "Content-Type": "application/json", 
    "Host": "httpbin.org", 
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/602.1 (KHTML, like Gecko) splash Version/9.0 Safari/602.1"
  }, 
  "json": {
    "name": "Germey"
  }, 
  "origin": "60.207.237.85", 
  "url": "http://httpbin.org/post"
}
status: 200
url: "http://httpbin.org/post"

You can see here that we successfully simulated the submission of a POST request and sent the form data.

set_content()

This method can be used to set the content of a page, for example:

function main(splash)
    assert(splash:set_content("&lt;html&gt;&lt;body&gt;&lt;h1&gt;hello&lt;/h1&gt;&lt;/body&gt;&lt;/html&gt;"))
    return splash:png()
end

The results are shown in Figures 7-12:

[External chain picture transfer failed (img-IvzYSj6W-1565082895577)]( https://upload-images.jianshu.io/upload_images/17885815-a8a9045645a4ffae.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)]

Figure 7-12 Running Results

html()

This method can be used to obtain the source code of a web page. It is a very simple and common method. An example is as follows:

function main(splash, args)
  splash:go("https://httpbin.org/get")
  return splash:html()
end

Run result:

<html><head></head><body><pre style="word-wrap: break-word; white-space: pre-wrap;">{
  "args": {}, 
  "headers": {
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", 
    "Accept-Encoding": "gzip, deflate", 
    "Accept-Language": "en,*", 
    "Connection": "close", 
    "Host": "httpbin.org", 
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/602.1 (KHTML, like Gecko) splash Version/9.0 Safari/602.1"
  }, 
  "origin": "60.207.237.85", 
  "url": "https://httpbin.org/get"
}
</pre></body></html>

png()

This method can be used to obtain a screenshot of a web page in PNG format, as shown below:

function main(splash, args)
  splash:go("https://www.taobao.com")
  return splash:png()
end

jpeg()

This method can be used to obtain a screenshot of a web page in JPEG format, as shown below:

function main(splash, args)
  splash:go("https://www.taobao.com")
  return splash:jpeg()
end

har()

This method can be used to get a description of the page loading process as follows:

function main(splash, args)
  splash:go("https://www.baidu.com")
  return splash:har()
end

The results are shown in Figures 7-13:

[External chain picture transfer failed (img-Qo58xLDp-1565082895578)]( https://upload-images.jianshu.io/upload_images/17885815-56c5ddd9f221c90c.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)]

Figure 7-13 Running Results
Details of each request record during page loading are shown here.

url()

This method can get the URL that is currently being accessed, for example:

function main(splash, args)
  splash:go("https://www.baidu.com")
  return splash:url()
end

The results are as follows:

https://www.baidu.com/

get_cookies()

This method can get Cookies for the current page, as shown in the following example:

function main(splash, args)
  splash:go("https://www.baidu.com")
  return splash:get_cookies()
end

The results are as follows:

Splash Response: Array[2]
0: Object
domain: ".baidu.com"
expires: "2085-08-21T20:13:23Z"
httpOnly: false
name: "BAIDUID"
path: "/"
secure: false
value: "C1263A470B02DEF45593B062451C9722:FG=1"
1: Object
domain: ".baidu.com"
expires: "2085-08-21T20:13:23Z"
httpOnly: false
name: "BIDUPSID"
path: "/"
secure: false
value: "C1263A470B02DEF45593B062451C9722"

add_cookie()

This method can add cookies to the current page as follows:

cookies = splash:add_cookie{name, value, path=nil, domain=nil, expires=nil, httpOnly=nil, secure=nil}

The parameters of the method represent the properties of the Cookie.
Examples are as follows:

function main(splash)
    splash:add_cookie{"sessionid", "237465ghgfsd", "/", domain="http://example.com"}
    splash:go("http://example.com/")
    return splash:html()
end
clear_cookies()

This method clears all Cookies, for example:

function main(splash)
    splash:go("https://www.baidu.com/")
    splash:clear_cookies()
    return splash:get_cookies()
end

Here we clear all Cookies, then call get_cookies() and return the result.
Run result:

Splash Response: Array[0]

You can see that the Cookies are completely emptied, with no results.

get_viewport_size()

This method can get the size, width and height, of the current browser page as follows:

function main(splash)
    splash:go("https://www.baidu.com/")
    return splash:get_viewport_size()
end

Run result:

Splash Response: Array[2]
0: 1024
1: 768

set_viewport_size()

This method can set the size, width and height, of the current browser page as follows:

splash:set_viewport_size(width, height)

For example, here we visit a page with an adaptive width, as shown below:

function main(splash)
    splash:set_viewport_size(400, 700)
    assert(splash:go("http://cuiqingcai.com"))
    return splash:png()
end
Python Resource Sharing qun 784758214 ,Installation packages are included. PDF,Learn video, here is Python The place where learners gather, zero base, advanced, all welcome

The results are shown in Figures 7-14:

[External chain picture transfer failed (img-IlYRAAgO-1565082895578)]( https://upload-images.jianshu.io/upload_images/17885815-abf0bd4551469ca2.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)]

Figure 7-14 Running Results

set_viewport_full()

This method can set the browser's full screen display as follows:

function main(splash)
    splash:set_viewport_full()
    assert(splash:go("http://cuiqingcai.com"))
    return splash:png()
end

set_user_agent()

This method can set the browser's User-Agent, for example:

function main(splash)
  splash:set_user_agent('Splash')
  splash:go("http://httpbin.org/get")
  return splash:html()
end

Here we set the browser's User-Agent to Splash and run it as follows:

<html><head></head><body><pre style="word-wrap: break-word; white-space: pre-wrap;">{
  "args": {}, 
  "headers": {
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", 
    "Accept-Encoding": "gzip, deflate", 
    "Accept-Language": "en,*", 
    "Connection": "close", 
    "Host": "httpbin.org", 
    "User-Agent": "Splash"
  }, 
  "origin": "60.207.237.85", 
  "url": "http://httpbin.org/get"
}
</pre></body></html>

You can see here that User-Agent was successfully set.

set_custom_headers()

This method can set the requested Headers, as shown in the following example:

function main(splash)
  splash:set_custom_headers({
     ["User-Agent"] = "Splash",
     ["Site"] = "Splash",
  })
  splash:go("http://httpbin.org/get")
  return splash:html()
end

Here we set the User-Agent and Site properties in Headers to run the results:

<html><head></head><body><pre style="word-wrap: break-word; white-space: pre-wrap;">{
  "args": {}, 
  "headers": {
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", 
    "Accept-Encoding": "gzip, deflate", 
    "Accept-Language": "en,*", 
    "Connection": "close", 
    "Host": "httpbin.org", 
    "Site": "Splash", 
    "User-Agent": "Splash"
  }, 
  "origin": "60.207.237.85", 
  "url": "http://httpbin.org/get"
}
</pre></body></html>

You can see that two fields in the result Headers have been successfully set.

select()

The select() method selects the first node that meets the criteria, and if more than one node meets the criteria, only one will be returned, with a CSS selector as an example:

function main(splash)
  splash:go("https://www.baidu.com/")
  input = splash:select("#kw")
  input:send_text('Splash')
  splash:wait(3)
  return splash:png()
end

Here we first visited Baidu, then checked the search box, then called send_text() method to fill in the text, and then returned to the screenshot of the web page.
The results are shown in Figures 7-15:

Figure 7-15 Running Results
You can see that we successfully filled in the input box.

select_all()

This method selects all eligible nodes with a parameter of CSS selector.Examples are as follows

function main(splash)
  local treat = require('treat')
  assert(splash:go("http://quotes.toscrape.com/"))
  assert(splash:wait(0.5))
  local texts = splash:select_all('.quote .text')
  local results = {}
  for index, text in ipairs(texts) do
    results[index] = text.node.innerHTML
  end
  return treat.as_array(results)
end

Here we select the body content of the node through the CSS selector, then iterate through all the nodes and get the text.
Run result:

Splash Response: Array[10]
0: ""The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.""
1: ""It is our choices, Harry, that show what we truly are, far more than our abilities.""
2: "There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle."
3: ""The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.""
4: ""Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.""
5: ""Try not to become a man of success. Rather become a man of value.""
6: ""It is better to be hated for what you are than to be loved for what you are not.""
7: ""I have not failed. I've just found 10,000 ways that won't work.""
8: ""A woman is like a tea bag; you never know how strong it is until it's in hot water.""
9: ""A day without sunshine is like, you know, night.""

You can see that we successfully retrieved the body contents of 10 nodes.

mouse_click()

This method can simulate a mouse click operation, and the parameters passed in are coordinate values x, y, or it can be invoked directly by selecting a node. An example is as follows:

function main(splash)
  splash:go("https://www.baidu.com/")
  input = splash:select("#kw")
  input:send_text('Splash')
  submit = splash:select('#su')
  submit:mouse_click()
  splash:wait(3)
  return splash:png()
end

Here we first select the input box of the page, enter the text, then select the submit button, call the mouse_click() method to submit the query, then wait three seconds for the page to return to the screenshot, as shown in Figure 7-16:

[External chain picture transfer failed (img-ZTQ37gRw-1565082895581). ( https://upload-images.jianshu.io/upload_images/17885815-7bea97e6c46ee282.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)]

Figure 7-16 Running Results
You can see here that we have successfully obtained the content of the page after query and simulated Baidu search operation.
Above we have introduced Splash's common API operations, and there are some APIs that are no longer covered here. For more detailed and authoritative instructions, see the official documentation: https://splash.readthedocs.io ..., this page describes all the API operations for splash objects, as well as API operations for page elements, linked as: https://splash.readthedocs.io....

7. Splash API calls

In the previous section, we explained the use of Splash Lua scripts, but these scripts were tested and run inside the Splash page. How can we use Splash to render the page?How can you use a Python program with JavaScript rendering to grab pages?
Splash actually provides us with some HTTP API interfaces. We only need to request these interfaces and pass the appropriate parameters to get the results of page rendering. Here's a description of these interfaces:

render.html

This interface is used to get the HTML code of the page rendered by JavaScript. The interface address is the running address of Splash plus the interface name, for example: http://localhost : 8050/render.html, we can use curl to test:

curl http://localhost:8050/render.html?url=https://www.baidu.com

We passed a URL parameter to this interface specifying the rendered URL and returned the result, which is the source code of the rendered page.
If implemented in Python, the code is as follows:

import requests
url = 'http://localhost:8050/render.html?url=https://www.baidu.com'
response = requests.get(url)
print(response.text)

In this way, we can successfully output the source code of the rendered Baidu page.
This interface can also specify other parameters, such as wait specifying the number of seconds to wait, which can be increased if we want to ensure that the page is fully loaded, for example:

import requests
url = 'http://localhost:8050/render.html?url=https://www.taobao.com&amp;wait=5'
response = requests.get(url)
print(response.text)

If this wait time is increased, the response time will be correspondingly longer. For example, here we will wait for about 5 seconds or more to get the source code of the Taobao page rendered by JavaScript.
This interface also supports proxy settings, picture loading settings, Headers settings, request method settings, and can be used in official documents: https://splash.readthedocs.io....

render.png

This interface can take a screenshot of a web page with several more parameters than render.html, such as width and height to control width and height, and returns PNG-formatted picture binary data.
Examples are as follows:

curl http://localhost:8050/render.png?url=https://www.taobao.com&amp;wait=5&amp;width=1000&amp;height=700

Here we also passed in width and height to scale the page to 1000x700 pixels.
If implemented in Python, we can save the returned binary data as pictures in PNG format as follows:

import requests

url = 'http://localhost:8050/render.png?url=https://www.jd.com&amp;wait=5&amp;width=1000&amp;height=700'
response = requests.get(url)
with open('taobao.png', 'wb') as f:
    f.write(response.content)

The resulting picture is shown in Figures 7-17:

[External chain picture transfer failed (img-asiKVNfa-1565082895582). ( https://upload-images.jianshu.io/upload_images/17885815-ccc3343c91574c4c.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)]

Figure 7-17 Running Results
In this way, we have successfully obtained a screenshot of the page after rendering the first page of Jingdong. Detailed parameter settings can be referred to the official website documentation: https://splash.readthedocs.io....

render.jpeg

This interface is similar to render.png, but it returns picture binary data in JPEG format.
In addition, this interface has one more parameter quality than render.png, which can be used to set picture quality.

render.har

This interface is used to obtain page loaded HAR data, as shown in the following example:

curl http://localhost:8050/render.har?url=https://www.jd.com&amp;wait=5

There are many results returned, which are in Json format and contain HAR data during page loading.
The results are shown in Figures 7-18:

[External chain picture transfer failed (img-1 ZejYGQA-1565082895583). ( https://upload-images.jianshu.io/upload_images/17885815-232f53982c604542.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)]

Figure 7-18 Running Results

render.json

This interface contains all the functions of the previous interface and returns the result in Json format as follows:

curl http://localhost:8050/render.json?url=https://httpbin.org

The results are as follows:

{"title": "httpbin(1): HTTP Client Testing Service", "url": "https://httpbin.org/", "requestedUrl": "https://httpbin.org/", "geometry": [0, 0, 1024, 768]}

You can see that the corresponding request data is returned here as Json.
We can control the results returned by passing in different parameters, such as html=1, which increases the source code data, png=1, which increases the chance of returning results, PNG screenshots of pages, and har=1 which gets page HAR data, for example:

curl http://localhost:8050/render.json?url=https://httpbin.org&amp;html=1&amp;har=1

This returns a Json result that contains the source code of the web page and the HAR data.
More parameter settings are available in the official documentation: https://splash.readthedocs.io....

execute

This is the most powerful interface, and we've talked a lot about how Splash Lua scripts work. This interface allows you to dock with Lua scripts.
The previous render.html, render.png interfaces are sufficient for rendering pages in JavaScript, but there is nothing you can do to achieve some interaction, so you need to use this execute interface to interact with Lua scripts and web pages.
Let's start with the simplest script that returns data directly:

function main(splash)
    return 'hello'
end

The script is then converted to a URL-encoded string and stitched to the back of the execute interface, as shown in the following example:

curl http://localhost:8050/execute?lua_source=function+main%28splash%29%0D%0A++return+%27hello%27%0D%0Aend

Run result:

hello

Here we pass the transcoded Lua script through the lua_source parameter, and get the result of the final script execution through the execute interface.
So what we're more concerned about here is definitely how to implement it in Python, which is the following example:

import requests
from urllib.parse import quote

lua = '''
function main(splash)
    return 'hello'
end
'''

url = 'http://localhost:8050/execute?lua_source=' + quote(lua)
response = requests.get(url)
print(response.text)

Run result:

hello

Here we include the Lua script in three quotes in Python, transcode the script URL using the quote() method in the urllib.parse module, and then construct the Splash request URL, passing it as the lua_source parameter so that the results of the run show the results of the execution of the Lua script.
Let's take another example:

import requests
from urllib.parse import quote

lua = '''
function main(splash, args)
  local treat = require("treat")
  local response = splash:http_get("http://httpbin.org/get")
    return {
    html=treat.as_string(response.body),
    url=response.url,
    status=response.status
    }
end
'''

url = 'http://localhost:8050/execute?lua_source=' + quote(lua)
response = requests.get(url)
print(response.text)

Run result:

{"url": "http://httpbin.org/get", "status": 200, "html": "{\n  \"args\": {}, \n  \"headers\": {\n    \"Accept-Encoding\": \"gzip, deflate\", \n    \"Accept-Language\": \"en,*\", \n    \"Connection\": \"close\", \n    \"Host\": \"httpbin.org\", \n    \"User-Agent\": \"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/602.1 (KHTML, like Gecko) splash Version/9.0 Safari/602.1\"\n  }, \n  \"origin\": \"60.207.237.85\", \n  \"url\": \"http://httpbin.org/get\"\n}\n"}

The result is in Json form, and we have successfully obtained the requested URL, status code, and web page source code.
In this way, the Lua scripts we mentioned earlier can be docked with Python in this way, so that some of the results of dynamic rendering, simulated clicks, form submission, page sliding, delayed waiting for all web pages can be freely controlled, and access to the page source code and screenshots is not in the process.

8. Splash Load Balancing Configuration

If we use Splash to capture pages rendered dynamically by JavaScript, if the amount of crawls is very large and the tasks are very large, if we use a Splash service to handle it, the pressure will be too high, so we can consider setting up a load balancer to distribute the pressure across servers, whichSamples are equivalent to multiple machines with multiple services participating in the processing of tasks and can reduce the pressure on a single Splash service.

1. Configure Splash Service

To set up Splash load balancing, first we need multiple Splash services. If I have Splash services turned on on port 8050 of the four remote hosts here, their service addresses are 41.159.27.223:8050, 41.159.27.221:8050, 41.159.27:8050, 41.159.117.119:8050, and the four services are identical.Is opened through the Plash mirror of Docker, and Splash services can be used to access any service.

2. Configure Load Balancing

Next we can configure load balancing using any host with public network IP. First we need to install Nginx on this host, then modify the configuration file nginx.conf of Nginx to add the following:

http {
    upstream splash {
        least_conn;
        server 41.159.27.223:8050;
        server 41.159.27.221:8050;
        server 41.159.27.9:8050;
        server 41.159.117.119:8050;
    }
    server {
        listen 8050;
        location / {
            proxy_pass http://splash;
        }
    }
}

This defines a service cluster configuration called splash through the upstream field, where least_conn represents the minimum link load balancing and is suitable for handling requests for varying lengths of time without overloading the server.

Or we can configure it as follows without specifying a configuration:

upstream splash {
    server 41.159.27.223:8050;
    server 41.159.27.221:8050;
    server 41.159.27.9:8050;
    server 41.159.117.119:8050;
}

This defaults to a polling policy for load balancing, where each server has the same pressure and is suitable for use by services that are fairly configured, stateless, and fast.

In addition, we can specify weights as follows:

upstream splash {
    server 41.159.27.223:8050 weight=4;
    server 41.159.27.221:8050 weight=2;
    server 41.159.27.9:8050 weight=2;
    server 41.159.117.119:8050 weight=1;
}

We weight each service by weight, and the higher the weight, the more requests are allocated to processing. This configuration can be used if different server configurations differ greatly.

Finally, there is an IP hash load balancing configuration as follows:

upstream splash {
    ip_hash;
    server 41.159.27.223:8050;
    server 41.159.27.221:8050;
    server 41.159.27.9:8050;
    server 41.159.117.119:8050;
}

The server does hash calculations based on the IP address of the requesting client to ensure that the same server is used to respond to requests. This strategy is appropriate for stateful services, such as when a user logs in and accesses a page.But not for Splash.

We can choose different configurations for different situations and restart the Nginx service after the configuration is complete:

sudo nginx -s reload

This enables load balancing by directly accessing port 8050 of the server where Nginx resides.

3. Configure authentication

Splash is now publicly accessible. If we don't want it to be publicly accessed, we can configure authentication. We can still use Nginx to add auth_base and auth_basic_user_file fields to the location field of the server, which are configured as follows:

http {
    upstream splash {
        least_conn;
        server 41.159.27.223:8050;
        server 41.159.27.221:8050;
        server 41.159.27.9:8050;
        server 41.159.117.119:8050;
    }
    server {
        listen 8050;
        location / {
            proxy_pass http://splash;
            auth_basic "Restricted";
            auth_basic_user_file /etc/nginx/conf.d/.htpasswd;
        }
    }
}

The username password configuration used here is placed in the / etc/nginx/conf.d directory, and we need to create it using the htpasswd command, such as creating a file with the username admin, with the following commands:

htpasswd -c .htpasswd admin

Then we will be prompted to enter the password. After two entries, a password file will be generated to check the contents:

cat .htpasswd 
admin:5ZBxQr0rCqwbc

When the configuration is complete, we restart the Nginx service and run the following command:

sudo nginx -s reload

This allows access authentication to be successfully configured.

4. Testing

Finally, we can use the code to test the load balancing configuration to see if each request will switch IP, using http://httpbin.org/get Test is OK, the code is as follows:

import requests
from urllib.parse import quote
import re

lua = '''
function main(splash, args)
  local treat = require("treat")
  local response = splash:http_get("http://httpbin.org/get")
  return treat.as_string(response.body)
end
'''

url = 'http://splash:8050/execute?lua_source=' + quote(lua)
response = requests.get(url, auth=('admin', 'admin'))
ip = re.search('(\d+\.\d+\.\d+\.\d+)', response.text).group(1)
print(ip)

The splash in the URL here should be replaced by your own Nginx server IP, where I modified Hosts and added the splash alias.

After running the code several times, you can see that the IP changes for each request:

As the first result:

41.159.27.223

Second result:

41.159.27.9
 Python resources share qun 784758214, including installation packages, PDF, learning videos. This is a place for Python learners to gather, zero-based, advanced and welcome

This indicates that load balancing has been successfully achieved.

9. Conclusion

So far, we have been able to capture JavaScript rendered pages using Python and Plash. In addition to Selenium, Splash mentioned in this section can also render very powerful pages without requiring a browser and is very convenient to use.

This section also successfully configures load balancing, which is useful if multiple Splash services can work together to reduce the load on a single service.

Tags: Python Javascript JSON Nginx

Posted on Tue, 06 Aug 2019 16:18:47 -0700 by dancahill