Reptilian basic 𞓜 1.4 exception handling and link resolution

1. Exception handling

The basic crawler skills have been mastered, but if there are exceptions in the resend request, such as bad network, rejected request, etc., an error may be reported and the running program may be terminated.

The error module of urllib defines the exception generated by the request module. If there is a problem, the request module will explode the exception defined in the error module. Now use the error module to handle various exceptions.

1.1URLErrror

The URLError class comes from the error module of urllib. It inherits from the OSError class. It is the base class of the error exception module. All exceptions generated by the request module can be caught through it.
One of its important properties, reason, returns the cause of the error, which helps us to determine the exception type.

from urllib import request,error
try:
    request.urlopen('https://blog.csdn.net/Watson_Ashin/nothispage')
except error.URLError as e:
    print(e.reason)

==> Not Found

We visited a nonexistent address above, and finally returned "Not Found", and the program did not report an error, that is to say, we successfully handled the exception, and the program is still running.

1.2HTTPERROR

It is a subclass of URLError, which is specially used to handle HTTP request errors, such as authentication request failure. It has three properties:

  • Code: return the HTTP status code, for example, 404 means the web page does not exist, 500 means the server internal error, etc.
  • Reason: return error reason
  • Headers: return request headers
from urllib import request,error
try:
    request.urlopen('https://blog.csdn.net/Watson_Ashin/nothispage')
except error.HTTPError as e:
    print(e.reason,e.code,e.headers,sep='\n')

==>The output is as follows

Not Found
404
Server: openresty
Date: Mon, 10 Feb 2020 07:36:31 GMT
Content-Type: text/html; charset=utf-8
Content-Length: 15600
Connection: close
Vary: Accept-Encoding
Set-Cookie: uuid_tt_dd=10_18823982490-1581320191409-585554; Expires=Thu, 01 Jan 2025 00:00:00 GMT; Path=/; Domain=.csdn.net;
Set-Cookie: dc_session_id=10_1581320191409.605227; Expires=Thu, 01 Jan 2025 00:00:00 GMT; Path=/; Domain=.csdn.net;
ETag: "5e3b798b-3cf0"

The above code returns the error reason, error code and request header. Because URLError is the parent of HTTPerror, the code is as follows:

from urllib import request,error
try:
    request.urlopen('https://blog.csdn.net/Watson_Ashin/nothispage')
except error.HTTPError as e:
    print(e.reason,e.code,e.headers,sep='\n')
except error.URLError as e:
    print(e.reason)
else:
    print('successful')

==>The output is as follows

Not Found
404
Server: openresty
Date: Mon, 10 Feb 2020 07:40:00 GMT
Content-Type: text/html; charset=utf-8
Content-Length: 15600
Connection: close
Vary: Accept-Encoding
Set-Cookie: uuid_tt_dd=10_18823982490-1581320400576-704001; Expires=Thu, 01 Jan 2025 00:00:00 GMT; Path=/; Domain=.csdn.net;
Set-Cookie: dc_session_id=10_1581320400576.261739; Expires=Thu, 01 Jan 2025 00:00:00 GMT; Path=/; Domain=.csdn.net;
ETag: "5e3b798b-3cf0"

In this way, HTTPError can be captured first, and its error status code, reason, headers, etc. can be obtained.
If it is not an HTTPError exception, the URLError exception will be caught and the error reason will be output.
Finally, use else to handle normal logic. This is a full exception handling method.

Sometimes, reason returns an object rather than an instance.

import socket
import urllib.request
import urllib.error
try:
    response = urllib.request.urlopen('https://www.baidu.com',timeout=0.01)
except urllib.error.URLError as e:
    print(type(e))
    if isinstance(e.reason,socket.timeout):
        print('TIME OUT')

==> 

<class 'urllib.error.URLError'>
TIME OUT

Here we directly set the timeout to force the timeout exception to be thrown.
As you can see, the result of the reason property is the socket.timeout class. So, here we can use isinstance () method to judge its type and make more detailed exception judgment.
So in order to make the crawler more robust, it needs to deal with all kinds of errors!

2. link resolution

The parse module is also provided in the urllib library, which defines the standard interface for URL processing, such as the extraction, combination and link transformation of various parts of URL. It supports URL processing of the following protocols: file, fip, gopher, hdl, http, https, imap, mailto, mms, news, nntp, prospero, rsync, rtsp, rtspu, sftp, sip, sips, snews, svn, svn+ssh, telnet and wais. ,

In fact, it is to build URL s, but in fact, common string splicing can achieve the same effect, and the use of parse module can effectively replace parameters, the other is to solve the coding problem.

2.1urlparse()

This method can realize URL recognition and segmentation

from urllib.parse import urlparse

result = urlparse('https://www.baidu.com/index.htrp;user?id=5#comment ')
print(type(result),result,sep='\n')
<class 'urllib.parse.ParseResult'>
ParseResult(scheme='https', netloc='www.baidu.com', path='/index.htrp', params='user', query='id=5', fragment='comment ')

As you can see, the returned result is an object of type ParseResult, which contains six parts, namely scheme, netloc, path, params, query and fragment. Look at the URL of this instance: http://www.baidu.com/index.html;user?id=S#comment , the urlparse () method splits it into six parts. A general observation shows that there are specific separators when parsing. : / / scheme stands for protocol; netloc stands for domain name in front of the first / symbol; path stands for access path; semicolon; params stands for parameter; question mark? Query is followed by query, which is generally used as a GET type URL (in the form of key value); well number is followed by anchor point, which is used to directly locate the drop-down position inside the page.

2.urlunparse()

Urluparse() is the opposite method of urlparsed(). The parameter he accepts is an iterative object, but the length must be 6. In fact, it is a simple character splicing to construct a URL.

from urllib.parse import urlunparse
data = ['htttp','www.baidu.com','index.html','user','a=6','comment']
print(urlunparse(data))

==>  htttp://www.baidu.com/index.html;user?a=6#comment

3.urlsplit()

This method is very similar to the urlparse() method, except that it no longer parses the params part alone, and only returns five results. And the corresponding value can be obtained through the index.

from urllib.parse import urlunparse
data = ['htttp','www.baidu.com','index.html','user','a=6','comment']
print(urlunparse(data))

==>SplitResult(scheme='http', netloc='www.baidu .com', path='/index.html;user', query='id=5', fragment='comment')

==>http

4.urlunsplit()

Similar to urluparse(), it is used to construct URL s, and the incoming parameter also needs to be an iterative object, but the length must be 5, that is, params is not passed in.

5.urljoin()

With the methods of urluparse() and urluliplit(), we can complete the combination of links, but only if there is a specific length of objects, and each part of the links should be clearly separated. In addition, there is another way to generate links, which is the urljoin() method. We can provide a basic link (base ur) as the first parameter and a new link as the second parameter. This method will analyze the scheme, netloc and path of base ur URL, supplement the missing part of the new link, and finally return the result.

from urllib.parse import urljoin
print(urljoin('http://www.baidu.com','FAQ.html'))
print(urljoin('http://www.baidu.com ','https://cuiqingcai.com/FAQ.html'))
print(urljoin('http://www.baidu.com/about.html', 'https://cuiqingcai.com/FAQ.html'))
print(urljoin('http://www.baidu.com/about.html','https://cuiqingcai.com/FAQ.html?question=2'))
print(urljoin('http://www.baidu.com?wd=abc','https://cuiqingcai.com/index.php'))
print(urljoin('http://www.baidu.com','?category=2#comment'))
print(urljoin('www.baidu.com','?category=2#comment'))
print(urljoin('www.baidu.com#comment','?category=2'))
http://www.baidu.com/FAQ.html
https://cuiqingcai.com/FAQ.html
https://cuiqingcai.com/FAQ.html
https://cuiqingcai.com/FAQ.html?question=2
https://cuiqingcai.com/index.php
http://www.baidu.com?category=2#comment
www.baidu.com?category=2#comment
www.baidu.com?category=2

The base URL provides three content schemes, netloc and path. If these three items do not exist in the new link, they will be supplemented; if the new link exists, the new link part will be used. The params,query and fragment in the base_url do not work.

6.urlencode()

!! Draw focus, this is a common method urlencode(), which is very useful when constructing GET request parameters. !!!

from urllib.parse import urlencode 
params = { 'name':'watson',
          'age' : 22}
base_url = 'http://www.baidu.com?'
url = base_url + urlencode(params)
print(url)

==>http://www.baidu.com?name=watson&age=22

Here, the parameters are passed to the urlencode() method through a dictionary, and then they are sequenced into URLs (here, they can only be used for GET requests). This is suitable for crawling multiple pages, but in fact, the underlying layer is a simple string substitution logic.

7.parse_qs() and parse_qsl()

With serialization, there must be deserialization. If we have a string of GET request parameters, we can use the parse ﹣ qs() or parse ﹣ qsl() methods to return them to the dictionary or tuple list.

from urllib.parse import parse_qs 
query= 'name=germey&age=22' 
print(parse_qs(query)) 

==>{'name': ['germey'], 'age': ['22']}

from urllib.parse import parse_qsl 
query= 'name=germey&age=22' 
print(parse_qsl(query)) 

==>[('name', 'germey'), ('age', '22')]

8.quote()

This method converts the content to a URL encoded format. When the URL contains Chinese parameters, sometimes it may lead to the problem of garbled code. In this case, you can use this method to convert Chinese into URL encoding

from urllib.parse import quote 
keyword ='Nezha' 
url = 'https://www.baidu.com/s?wd='+ quote(keyword) 
print(url) 

==>https://www.baidu.com/s?wd=%E5%93%AA%E5%90%92

This method is often used in the search of crawlers. In addition to Taobao, character conversion is also often needed in the tourism category.

9.unquote()

unquote is to decode the URL encoding character since it needs to be decoded again.

from urllib.parse import unquote
print(unquote('%E5%93%AA%E5%90%92'))

= = > Na Zha

 

Published 17 original articles, won praise 12, visited 10000+
Private letter follow

Tags: encoding Fragment socket svn

Posted on Wed, 12 Feb 2020 23:41:34 -0800 by mediamind