Correct use of parser for Python crawler BS4 Library

The reason bs4 libraries can quickly locate the elements we want is because they can parse html files in one way, and different parsers have different effects.This is described below.

bs4 Parser Selection

The ultimate purpose of a web crawler is to filter the selected network information, the most important part can be said to be a parser.The speed and efficiency of the crawler are determined by the parser.The bs4 library supports many third-party parsers in addition to the'html.parser'parsers we used above. Here's a comparison of them.

The bs4 library officially recommends that we use the lxml parser because it is more efficient, so we will also use the lxml parser.
PS Note: Many people will encounter various troubles when learning Python, and no one can easily give up without an answer.A free Python answer stack has been created for this little one. Skirt: seven clothes, ninety-seven bars and five (digital homophonic) can be found under the conversion, old drivers can solve problems they do not understand, there is also the latest Python actual combat tutorial, so we can monitor each other and make progress together!

Installation of lxml parser:

  • Still install using the pip installation tool:
$ pip install lxml

Note that since I'm using a unix class system, pip tools are very convenient, but if you install them under windows, this or that problem will always occur. Here, we recommend that the winuser go to the lxml authorities and download the installation package to install the lxml parser for their own version of the system.

Use lxml parser to interpret web pages

Let's still use the last Alice document as an example

    html_doc = """
    <html><head><title>The Dormouse's story</title></head>
    <body>
    <p class="title"><b>The Dormouse's story</b></p>
    
    <p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
    
    <p class="story">...</p>
    """

Try it:

import bs4
    
    
#First let's make a pot of soup from html files already lxml
soup = bs4.BeautifulSoup(open('Beautiful Soup Reptiles/demo.html'),'lxml')
    
#When we output the result, it's a clear tree structure.
#print(soup.prettify())
    
'''
OUT:
    
<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>
'''

How to use it specifically?

The bs4 library first converts the incoming string or file handle to a Unicode type, so that when we grab Chinese information, we won't have any coding problems.Of course, there are some obscure codes such as `big5', which require us to set them manually:
soup = BeautifulSoup(markup, from_encoding= "encoding")

Types of objects:

The bs4 library converts complex html documents into a complex tree structure. Each node is a Python object. All objects can be divided into four types: Tag, NavigableString, BeautifulSoup, Comment
Let's explain one by one:

Tag: It is basically the same as Tag in html and can be used easily

NavigableString: A string wrapped in a tag

BeautifulSoup: Represents the entire content of a document. Most of the time, you can see it as a tag object, which supports traversing the document tree and searching the document tree.

Comment: This is a special NavigableSting object that is output in a special format, such as the comment type, when it appears in an html document.

The easiest way to search the document tree is to search for the name of the tag you want to get:

soup.head
# <head><title>The Dormouse's story</title></head>

soup.title
# <title>The Dormouse's story</title>

If you want to go deeper and get a smaller tag: for example, we want to find the part under the body that is wrapped by tag b

soup.body.b
# <b>The Dormouse's story</b>

But this method only finds tag s that appear first in order

Get all the tags?

This time you need the find_all() method, which returns a list type

tag=soup.find_all('a')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
#Suppose we want to find the second element in the a tag:
need = tag[1]
#Simple

The.contents property of tag can output the child nodes of tag as a list:

head_tag = soup.head
head_tag
# <head><title>The Dormouse's story</title></head>

head_tag.contents
[<title>The Dormouse's story</title>]
title_tag = head_tag.contents[0]
print(title_tag)
# <title>The Dormouse's story</title>
title_tag.contents
# [u'The Dormouse's story']
  • In addition, tag's.children generator allows you to loop through tag's child nodes:
for child in title_tag.children:
    print(child)
    # The Dormouse's story
  • This way, only child nodes can be traversed.How do I traverse descendant nodes?

Child node: For example, the child node of head.contents is <title>The Dormouse's story</title>, where title itself has a child node:'The Dormouse's story'.Here,'The Dormouse's story'is also called the head's descendant node

for child in head_tag.descendants:
    print(child)
    # <title>The Dormouse's story</title>
    # The Dormouse's story

How do I find all the text content under tag?

1. If the tag has only one child node (NavigableString type): it can be found by using tag.string directly.

2. If tag has many children and grandchildren, and each node has a string:

We can find them all iteratively:

for string in soup.strings:
    print(repr(string))
    # u"The Dormouse's story"
    # u'\n\n'
    # u"The Dormouse's story"
    # u'\n\n'
    # u'Once upon a time there were three little sisters; and their names were\n'
    # u'Elsie'
    # u',\n'
    # u'Lacie'
    # u' and\n'
    # u'Tillie'
    # u';\nand they lived at the bottom of a well.'
    # u'\n\n'
    # u'...'
    # u'\n'Well, let's start with the basic use of the bs4 library.The rest: the parent node, the sibling node, fallback and forward, are all similar to the above process of finding elements from the child nodes.

summary

Many people will encounter various troubles when learning Python, and no one can easily give up without an answer.A free Python answer stack has been created for this little one. Skirt: seven clothes, ninety-seven bars and five (digital homophonic) can be found under the conversion, old drivers can solve problems they do not understand, there is also the latest Python actual combat tutorial, so we can monitor each other and make progress together!

The text and pictures in this article come from the Internet plus your own ideas. They are for learning and communication purposes only. They do not have any commercial use. Copyright is owned by the original author. If you have any questions, please contact us in time for processing.

Tags: Python pip network Unix

Posted on Sat, 21 Mar 2020 20:09:49 -0700 by 696020