Introduction to Python crawler common libraries (requests, beautifulsop, lxml, json)

1. requests Library

http The most common protocol is GET Method:
import requests

response = requests.get('http://www.baidu.com')
print(response.status_code)  # Print status code
print(response.url)          # Print request url
print(response.headers)      # Printhead information
print(response.cookies)      # Printing cookie information
print(response.text)         #Print source code of webpage in text form
print(response.content)      #Print as byte stream

 

In addition to this GET method, there are many other methods:
import requests

requests.get('http://httpbin.org/get')
requests.post('http://httpbin.org/post')
requests.put('http://httpbin.org/put')
requests.delete('http://httpbin.org/delete')
requests.head('http://httpbin.org/get')
requests.options('http://httpbin.org/get')

 

2. Beautifulsop Library

The main functions of the beautifulsop library are as follows:

After being parsed by the Beautiful library, the Soup document is output according to the structure of the standard indentation format, which makes preparation for structured data and data filtering and extraction.

Soup documents can use the find() and find all() methods and the selector method to locate the required elements:
1. Find all() method

Search. Find all ('div ', "item") 񖓿 find div tag, class="item"

find_all(name, attrs, recursive, string, limit, **kwargs)
@PARAMS:
    name: Lookup value,Could be string,list,function,True value or re regular expression 
    attrs: Lookup value Some properties of, class And so on.
    recursive: Whether to recursively find subclasses, bool type
    string: With this parameter, the search result is string Type; if and name Matching is to find a match name Include string The result.
    limit: Lookup value Number of
    **kwargs: Some other parameters

 

2. find() method

The find() method is similar to the find all() method, except that the find all() method returns all the tags that meet the conditions in the document, a collection, and a Tag returned by the find() method

3. select() method

From big to small, the needed information can be extracted and copied by browser.

Introduction to select method

Example:

<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

When writing css, the tag name is not modified, the class name is preceded by dot, and the id name is preceded by ා. We can use similar methods to filter elements, using the method of soup.select(), and the return type is list.

 

(1) . find by tag name

print(soup.select('title')) #Filter all as title And print its label properties and contents
# [<title>The Dormouse's story</title>]

print(soup.select('a')) #Filter all as a Label
# [<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

print(soup.select('b')) #Filter all as b And print
# [<b>The Dormouse's story</b>]

 

(2) . find by class name

print soup.select('.sister')    #Find all class by sister And print its label properties and contents
# [<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

 

(3) . find by id name

print soup.select('#link1') #Find all id by link1 And print its label properties and contents
#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]

 

(4) . combination search

The principle of combining the tag name with the class name and id name is the same when the class file is written. For example, in the p tag, the id is equal to the content of link1, and the two need to be separated by spaces.

print soup.select('p #link1')
#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]

 

Direct sub label lookup

print soup.select("head > title")
#[<title>The Dormouse's story</title>]

 

(5) . attribute lookup

Attribute elements can also be added when searching. Attributes need to be enclosed by brackets. Note that attributes and labels belong to the same node, so spaces cannot be added in the middle, or they will not match.

print soup.select("head > title")
#[<title>The Dormouse's story</title>]
 
print soup.select('a[href="http://example.com/elsie"]')
#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]

 

Properties can still be combined with the above search methods. Spaces are not separated in the same node, and spaces are not added in the same node.

print soup.select('p a[href="http://example.com/elsie"]')
#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]

 

Example of beautifulsop Library:

from bs4 import BeautifulSoup
import requests

f = requests.get(url,headers=headers) 
   soup = BeautifulSoup(f.text,'lxml') 

   for k in soup.find_all('div',class_='pl2'):     #find div also class by pl2 Label
      b = k.find_all('a')       #In each corresponding div Search under labels a Tag, you'll find a a There are four groups in it span
      n.append(b[0].get_text())    #Take the first group span String in

 

3, lxml Library

lxml is an HTML/XML parser. Its main function is how to parse and extract HTML/XML data.

An example is as follows:

# Use lxml Of etree library
from lxml import etree 

text = '''
<div>
    <ul>
         <li class="item-0"><a href="link1.html">first item</a></li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-inactive"><a href="link3.html">third item</a></li>
         <li class="item-1"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a> # Note that a < / Li > closed label is missing here
     </ul>
 </div>
'''

#utilize etree.HTML,Parse string to HTML File
html = etree.HTML(text) 

# Serialize by string HTML File
result = etree.tostring(html) 

print(result)

 

The output results are as follows:

<html><body>
<div>
    <ul>
         <li class="item-0"><a href="link1.html">first item</a></li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-inactive"><a href="link3.html">third item</a></li>
         <li class="item-1"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
 </div>
</body></html>

You can see. lxml automatically modifies the HTML code. In the example, not only the li tag, but also the body and HTML tags are added.

4, json Library

function describe
json.dumps Encoding python objects into JSON strings
json.loads Parse the encoded JSON string into a python object
1. Use of json.dumps
#!/usr/bin/python
import json

data = [ { 'name' : 'Zhang San', 'age' : 25}, { 'name' : 'Li Si', 'age' : 26} ]

jsonStr1 = json.dumps(data) #take python Conversion of object to JSON Character string
jsonStr2 = json.dumps(data,sort_keys=True,indent=4,separators=(',',':')) #Give Way JSON Data format output,sort_keys:When key Is text, this value is True Then print in order, for False Print randomly
jsonStr3 = json.dumps(data, ensure_ascii=False) #Do not convert Chinese characters to unicode Code

print(jsonStr1)
print('---------------Dividing line------------------')
print(jsonStr2)
print('---------------Dividing line------------------')
print(jsonStr3)

 

Output results:

[{"name": "\u5f20\u4e09", "age": 25}, {"name": "\u674e\u56db", "age": 26}]
---------------Dividing line------------------
[
    {
        "age":25,
        "name":"\u5f20\u4e09"
    },
    {
        "age":26,
        "name":"\u674e\u56db"
    }
]
---------------Dividing line------------------
[{"name": "Zhang San", "age": 25}, {"name": "Li Si", "age": 26}]
2. Use of json.loads
#!/usr/bin/python
import json

data = [ { 'name' : 'Zhang San', 'age' : 25}, { 'name' : 'Li Si', 'age' : 26} ]

jsonStr = json.dumps(data)
print(jsonStr)

jsonObj = json.loads(jsonStr)
print(jsonObj)
# Get collection first
for i in jsonObj:
    print(i['name'])

 

The output result is:

[{"name": "\u5f20\u4e09", "age": 25}, {"name": "\u674e\u56db", "age": 26}]

[{'name': 'Zhang San', 'age': 25}, {'name': 'Li Si', 'age': 26}]

Zhang San
 Lee four '

Tags: Python JSON Attribute xml

Posted on Sun, 15 Mar 2020 22:06:51 -0700 by Concat