xpath of Python 3 crawler

1, Introduction

XPath is a language for finding information in XML documents. XPath can be used to traverse elements and attributes in an XML document. XPath is the main element of the W3C's XSLT standard, and XQuery and XPointer are built on top of XPath expressions.

2, Installation

pip3 install lxml

3, Use

Select node

Common path expressions

  • nodename (expression,)

    • Select all children of nodename node (description)
    • xpath('/ / div') selects all children (instances) of the div node
  • /

    • Select from root node
    • xpath('/ div') selects a div node from the root node
  • //

    • Select all current nodes regardless of their location
    • xpath('/ / div') selects all div nodes
  • .

    • Select current node
    • xpath('./div') selects the div node under the current node
  • ..

    • Select the parent of the current node
    • xpath('..') back to previous node
  • @

    • Pick properties
    • xpath ('/ / @ calss') selects all class attributes

Predicate: embedded in square brackets, used to find a specific node or node containing a specified value

  • xpath('/ body/div[1]')   select the first div node under the body

  • xpath('/ body/div[last()]')   select the last div node under the body

  • xpath('/ body/div[last()-1]')   select the next to last div node under the body

  • xpath('/ body / div [position() < 3]')   select the first two div nodes under the body

  • xpath('/body/div[@class ]')   select the div node with class attribute under the body

  • xpath('/ body/div[@class = "main"]')   select the div node whose class attribute under body is main

  • xpath('/ body / div [price > 35.00]')   select the div node whose price element value is greater than 35 under the body

Wildcard: Xpath selects unknown XML elements through wildcards

  • xpath ('/ div / *')   select all child nodes under div

  • xpath('/div[@* ]')   select all div nodes with attributes

Take multiple paths: use the "| operator to select multiple paths

  • xpath('/ / div / / table')   selects all div and table nodes

Function function: using function function can make fuzzy search better

  • starts-with

    • xpath('//div[starts-with(@id "," ma ")] ')   select the div node whose id value starts with ma
  • contains

    • xpath('/ / div[contains(@id, "ma")]')   select the div node whose id value contains ma
  • and

    • xpath('/ / div[contains(@id, "ma") and contains(@id, "in")]')   select the div node whose id value contains ma and in
  • text()

    • xpath('/ / div[contains(text(), "ma")]')   select the div node whose node text contains ma

Common functions:

Precise positioning

  • contains(str1,str2) is used to determine whether str1 contains str2

    • Example 1: / / * [contains (@ class, 'c-summary c-row')] select the node containing c-summary c-row in the @ class value
    • Example 2: / / div[contains(.//text(),'price ')] select the div node containing the price in text()
  • position() selects the current node

    • Example 1: / / * [@ class='result'][position()=1] select the first node of @ class='result'
    • Example 2: / / * [@ class='result '] [position() < = 2] select the first two nodes of @ class='result'
  • last() selects the current last few nodes

    • Example 1: / / * [@ class='result'][last()] select the last node of @ class='result'
    • Example 2: / / * [@ class='result'][last()-1] select the penultimate node of @ class='result'
  • Following sibling selects all peers after the current node

    • For example, 1: / / div [@ class='result '] / following sibling:: div select the div node of @ class='result' and then all the same level div nodes can find multiple nodes by - position, for example: / / div [@ class='result '] / following sibling:: div [position() = 1]
  • Preceding sibling selects all peers before the current node

    • Use the same method as following sibling

Filtering information

  • Substring before (str1, str2) is used to return the part of the string str1 before the first str2

    • Example: substring before (. / / * [@ class ='c-more_ Link '] / text(),' bar ')
    • Return to. / / * [@ class ='c-more_ The part before the first 'bar' in link '] / text(). If there is no' bar ', it will return a null value
  • Substring after (str1, str2) is similar to substring before, which returns the part of str1 after the first str2

    • Example 1: substring after (. / / * [@ class ='c-more_ Link '] / text(),' bar ')
    • Return to. / / * [@ class ='c-more_ The part after the first 'bar' in link '] / text(). If there is no' bar ', it returns a null value
    • Example 2: substring after (substring before (. / / * [@ class ='c-more_ Link '] / text(),' news'), 'news')
    • Return to. / / * [@ class ='c-more_ The part between the front of the first 'news' and the back of the first' article 'in link'] / text()
  • Normalize space() is used to delete the white space characters at the beginning and end of a string. If there are multiple consecutive white space characters in the middle of a string, a space will be used instead

    • Example: normalize space (. / / * [contains (@ class, 'c-summaryc-row')])
  • translate(string,str1,str2) if the character in string appears in str1, replace it with the character in the same position of str2 corresponding to str1. If the character in str2 cannot be found, delete the character in string

    • Example: translate('12:30','03','54 ') result:' 12:45 '

Splicing information

4, Use cases

(1) Read html

from lxml import etree
 
wb_data = """
        <div>
            <ul>
                 <li class="item-0"><a href="link1.html">first item</a></li>
                 <li class="item-1"><a href="link2.html">second item</a></li>
                 <li class="item-inactive"><a href="link3.html">third item</a></li>
                 <li class="item-1"><a href="link4.html">fourth item</a></li>
                 <li class="item-0"><a href="link5.html">fifth item</a>
             </ul>
         </div>
        """
html = etree.HTML(wb_data)
print(html)
result = etree.tostring(html)
print(result.decode("utf-8"))

From the following results, our printer html is actually a python object, etree.tostring(html) is the basic way of writing HTML in incomplete Li, which complements the label of missing arms and legs.

<Element html at 0x39e58f0>
<html><body><div>
            <ul>
                 <li class="item-0"><a href="link1.html">first item</a></li>
                 <li class="item-1"><a href="link2.html">second item</a></li>
                 <li class="item-inactive"><a href="link3.html">third item</a></li>
                 <li class="item-1"><a href="link4.html">fourth item</a></li>
                 <li class="item-0"><a href="link5.html">fifth item</a>
             </li></ul>
         </div>
        </body></html>

(2) Open read html file

#Using parse to open html files
html = etree.parse('test.html',etree.HTMLParser())
html_data = html.xpath('//*'< br > ා printing is a list, which needs to be traversed
print(html_data)
for i in html_data:
    print(i.text)
html = etree.parse('test.html', etree.HTMLParser())
html_data = etree.tostring(html,pretty_print=True)
res = html_data.decode('utf-8')
print(res)
 
Print:
<div>
     <ul>
         <li class="item-0"><a href="link1.html">first item</a></li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-inactive"><a href="link3.html">third item</a></li>
         <li class="item-1"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a></li>
     </ul>
</div>

(3) Get the content of a tag (basic use). Note that if you get all the content of a tag, you don't need to add forward slash after a, otherwise an error will be reported.

Style 1:

html = etree.HTML(wb_data)
html_data = html.xpath('/html/body/div/ul/li/a')
print(html)
for i in html_data:
    print(i.text)
 
 
<Element html at 0x12fe4b8>
first item
second item
third item
fourth item
fifth item

Write method 2: directly add a / text() after the label to find the content

html = etree.HTML(wb_data)
html_data = html.xpath('/html/body/div/ul/li/a/text()')
print(html)
for i in html_data:
    print(i)
 
<Element html at 0x138e4b8>
first item
second item
third item
fourth item
fifth item

(4) Print the properties of a tag under the specified path (you can get the value of a property and find the content of the tag by traversing)

html = etree.HTML(wb_data)
html_data = html.xpath('/html/body/div/ul/li/a/@href')
for i in html_data:
    print(i)
 
//Print:
link1.html
link2.html
link3.html
link4.html
link5.html

(5) All the ElementTree objects are obtained by using xpath, so if you need to find the content, you need to traverse the list of data.

html = etree.HTML(wb_data)
html_data = html.xpath('/html/body/div/ul/li/a[@href="link2.html"]/text()')
print(html_data)
for i in html_data:
    print(i)
 
//Print:
['second item']
second item

(6) Above we find that all are absolute paths (each is found from the root), and below we find relative paths, for example, find the content of a tag under all li tags.

html = etree.HTML(wb_data)
html_data = html.xpath('//li/a/text()')
print(html_data)
for i in html_data:
    print(i)
 
//Print:
['first item', 'second item', 'third item', 'fourth item', 'fifth item']
first item
second item
third item
fourth item
fifth item

(7) Above we use absolute path to find that the attributes of a l l a tags are equal to the value of the href attribute. We use / --- absolute path. Next, we use relative path to find the value of the a tag under the li tag under the L relative path. Note that double / / is required after the a tag.

html = etree.HTML(wb_data)
html_data = html.xpath('//li/a//@href')
print(html_data)
for i in html_data:
    print(i)
 
//Print:
['link1.html', 'link2.html', 'link3.html', 'link4.html', 'link5.html']
link1.html
link2.html
link3.html
link4.html
link5.html

(8) Find the a tag's href attribute in the last li tag

html = etree.HTML(wb_data)
html_data = html.xpath('//li[last()]/a/text()')
print(html_data)
for i in html_data:
    print(i)
 
//Print:
['fifth item']
fifth item

(9) Find the a tag's href attribute in the penultimate li tag

html = etree.HTML(wb_data)
html_data = html.xpath('//li[last()-1]/a/text()')
print(html_data)
for i in html_data:
    print(i)
 
//Print:
['fourth item']
fourth item

Tags: Big Data Attribute xml Python

Posted on Fri, 05 Jun 2020 20:15:25 -0700 by BinaryDragon