python crawler - lxml, xpath

Climbing for the hot movie of Douban Wafer: Climbing for Douban Wafer, Climbing for Douban Wafer, Climbing for Douban Wafer, Climbing for Douban Wafer. Watercress movie

import requests
from lxml import etree

header = {
    'user-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) '
                  'Chrome/76.0.3809.132 Safari/537.36 '
}
url = 'https://movie.douban.com/cinema/nowplaying/foshan/'
req = requests.get(url, headers=header)
txt = req.text
html = etree.HTML(txt)
ul = html.xpath('//ul[@class="lists"]')[0]
#Or:
# ul = html.xpath('//div[@id="nowplaying"]//ul[@class="lists"]')[0]
#Getting the movie title exists in lis
lis = ul.xpath('./li/@data-title')
for li in lis:
    print(li)

Result:

Several points need to be noted:

First: xpath takes a list, so you need to add the following elements to get the list, or else you will have "no serialization"

Error: TypeError: Type'list'cannot be serialized.

import requests
from lxml import etree

header = {
    'user-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) '
                  'Chrome/76.0.3809.132 Safari/537.36 '
}
url = 'https://movie.douban.com/cinema/nowplaying/foshan/'
req = requests.get(url, headers=header)
txt = req.text
html = etree.HTML(txt)
#Error:
ul = html.xpath('//ul[@class="lists"]')
#Correct:
#ul = html.xpath('//ul[@class="lists"]')[0]
print(etree.tostring(ul,encoding='utf-8').decode('utf-8'))

Second: under a label, execute xpath again to get the descendant label of the label. A dot should be added before // to represent searching under the current element. Otherwise, it will jump out of the current element and find all matching labels directly throughout the page.

import requests
from lxml import etree

header = {
    'user-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) '
                  'Chrome/76.0.3809.132 Safari/537.36 '
}
url = 'https://movie.douban.com/cinema/nowplaying/foshan/'
req = requests.get(url, headers=header)
txt = req.text
html = etree.HTML(txt)
ul = html.xpath('//ul[@class="lists"]')[0]
lis = ul.xpath('//li/@data-title')# gets all the titles that match the current page, including upcoming movies.
#lis = ul.xpath('.//li/@data-title')# captures all titles under ul, excluding upcoming movies
for li in lis:
    print(li)

Result:

Summary: Still learning, will constantly update...

Tags: Windows encoding

Posted on Sun, 06 Oct 2019 19:01:24 -0700 by lucym