Description of web Crawler - Phantom JS Virtual Browser + selenium Module Operating Phantom JS

Phantom JS Virtual Browser

phantomjs is a headless browser of webkit kernel based on js, that is, a browser without display interface. With this software, you can get any information loaded by web site js, that is, information loaded asynchronously by browser.

Unzip the Phantom JS file after downloading, and cut the unzipped folder into the python installation folder

Then add system environment variables to the bin folder in the Phantom JS folder

cdm input command: Phantom JS has the following information indicating successful installation

selenium module is a module for python to operate Phantom JS software

selenium module Phantom JS software

webdriver.PhantomJS() instantiates PhantomJS browser objects
get('url') to visit the website
find_element_by_xpath('xpath expression') Finds the corresponding element through the XPath expression
clear() empty the contents of the input box
send_keys('content') writes content to the input box
click() click event
get_screenshot_as_file('Screenshot save path name') saves a screenshot of a web page to this directory
page_source to get htnl source code for web pages
Qut () Close Phantom JS Browser

#!/usr/bin/env python
# -*- coding:utf8 -*-
from selenium import webdriver  #Import selenium module to operate Phantom JS
import os
import time
import re

llqdx = webdriver.PhantomJS()  #Instantiating Phantom JS Browser Objects
llqdx.get("https://www.baidu.com/"" Visit the Web Site

# time.sleep(3)   #Wait 3 seconds
# llqdx.get_screenshot_as_file('H:/py/17/img/123.jpg')  #Save screenshots of web pages to this directory

#Analog user operation
llqdx.find_element_by_xpath('//* [@id= "kw"]'. clear()# finds the input box through the xpath expression, and clear() clears the contents of the input box.
llqdx.find_element_by_xpath('//* [@id= "kw"]'. send_keys ('hawking network')# Finds the input box through the xpath expression, and send_keys() writes the content to the input box.
llqdx.find_element_by_xpath('//* [@id= "su"]'. click()# Finds the search button through the xpath expression, click() clicks the event

time.sleep(3)   #Wait 3 seconds
llqdx.get_screenshot_as_file('H:/py/17/img/123.jpg')  #Save screenshots of web pages to this directory

neir = llqdx.page_source   #Getting Web Content
print(neir)
llqdx.quit()    #Close the browser

pat = "<title>(.*?)</title>"
title = re.compile(pat).findall(neir)  #Regular matching of page titles
print(title)
//What I don't know in the process of learning can be added to me?
python learning resource qun,855 408 893
//There are good learning video tutorials, development tools and e-books in the group.
//Share with you the current talent needs of python enterprises and how to learn python from zero foundation, and what to learn

Phantom JS Browser Camouflage and Scrollbar Load Data

Some websites load data dynamically, requiring scrollbars to scroll to load data

Implementation code

Desired Capabilities camouflage browser objects
execute_script() executes js code

Curr_url Gets the current URL

#!/usr/bin/env python
# -*- coding:utf8 -*-
from selenium import webdriver  #Import selenium module to operate Phantom JS
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities   #Import Browser Camouflage Module
import os
import time
import re

dcap = dict(DesiredCapabilities.PHANTOMJS)
dcap['phantomjs.page.settings.userAgent'] = ('Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.221 Safari/537.36 SE 2.X MetaSr 1.0')
print(dcap)
llqdx = webdriver.PhantomJS(desired_capabilities=dcap)  #Instantiating Phantom JS Browser Objects

llqdx.get("https://www.jd.com/"" Visit the Web Site

#Analog user operation
for j in range(20):
    js3 = 'window.scrollTo('+str(j*1280)+','+str((j+1)*1280)+')'
    llqdx.execute_script(js3)  #Execute js language scrollbar
    time.sleep(1)

llqdx.get_screenshot_as_file('H:/py/17/img/123.jpg')  #Save screenshots of web pages to this directory

url = llqdx.current_url
print(url)

neir = llqdx.page_source   #Getting Web Content
print(neir)
llqdx.quit()    #Close the browser

pat = "<title>(.*?)</title>"
title = re.compile(pat).findall(neir)  #Regular matching of page titles
print(title)

Python Resource Sharing Skirt: 855408893 has installation packages, learn video materials, update technology every day. Here is the gathering place of Python learners, zero foundation, advanced, welcome to click Python resource sharing

Tags: Python Selenium network Windows

Posted on Tue, 13 Aug 2019 04:24:49 -0700 by AL-Kateb