Python3 Standard Library: urllib.parse decomposition URL

1. urllib.parse decomposition URL

The urllib.parse module provides functions to manage URLs and their components, including breaking URLs into components and making URLs out of components.

1.1 parsing

The return value of the urlparse() function is a ParseResult object, which is equivalent to a tuple containing six elements.

from urllib.parse import urlparse

url = 'http://netloc/path;param?query=arg#frag'
parsed = urlparse(url)
print(parsed)

The URL s obtained from tuple interfaces are mechanism, network location, path, path segment parameters (paths are separated by a semicolon), query, and fragment.

Although the return value is equivalent to a tuple, it is actually based on a namedtuple, which is a subclass of the tuple and, in addition to being accessible through an index, supports accessing parts of the URL through named attributes.Not only is the property API easier for programmers to use, it also allows access to many values that are not provided in the tuple API.

from urllib.parse import urlparse

url = 'http://user:pwd@NetLoc:80/path;param?query=arg#frag'
parsed = urlparse(url)
print('scheme  :', parsed.scheme)
print('netloc  :', parsed.netloc)
print('path    :', parsed.path)
print('params  :', parsed.params)
print('query   :', parsed.query)
print('fragment:', parsed.fragment)
print('username:', parsed.username)
print('password:', parsed.password)
print('hostname:', parsed.hostname)
print('port    :', parsed.port)

Enter the possible user name (username) and password in the URL and set to None if they are not provided.The hostname is the same as the netloc value, all lowercase, and the port value is removed.If there is a port, it is converted to an integer, if not set to None.

 

The urlsplit() function can replace urlparse(), but behaves slightly differently because it does not decompose parameters from URL s.

from urllib.parse import urlsplit

url = 'http://user:pwd@NetLoc:80/p1;para/p2;para?query=arg#frag'
parsed = urlsplit(url)
print(parsed)
print('scheme  :', parsed.scheme)
print('netloc  :', parsed.netloc)
print('path    :', parsed.path)
print('query   :', parsed.query)
print('fragment:', parsed.fragment)
print('username:', parsed.username)
print('password:', parsed.password)
print('hostname:', parsed.hostname)
print('port    :', parsed.port)

Because there are no decomposition parameters, the tuple API Hull shows five elements instead of six, and there is no params attribute.

To strip a fragment identifier from a URL, such as finding the base page name from a URL, you can use urldefrag().

from urllib.parse import urldefrag

original = 'http://netloc/path;param?query=arg#frag'
print('original:', original)
d = urldefrag(original)
print('url     :', d.url)
print('fragment:', d.fragment)

The return value is a namedtuple-based DefragResult that contains base URL s and fragments.

1.2 Inverse Resolution

You can also use some methods to reassemble parts of a broken URL into a string.The resolved URL object has a geturl() method.

from urllib.parse import urlparse

original = 'http://netloc/path;param?query=arg#frag'
print('ORIG  :', original)
parsed = urlparse(original)
print('PARSED:', parsed.geturl())

geturl() only applies to objects returned by urlparse() or urlsplit().

With urlunparse(), ordinary tuples containing strings can be reassembled into a single URL.

from urllib.parse import urlparse, urlunparse

original = 'http://netloc/path;param?query=arg#frag'
print('ORIG  :', original)
parsed = urlparse(original)
print('PARSED:', type(parsed), parsed)
t = parsed[:]
print('TUPLE :', type(t), t)
print('NEW   :', urlunparse(t))

Although the ParseResult returned by urlparse() can be used as a tuple, this example explicitly creates a new tuple to show that urlunparse() also applies to ordinary tuples.

If the input URL contains redundant parts, the reconstructed URL may remove them.(

from urllib.parse import urlparse, urlunparse

original = 'http://netloc/path;?#'
print('ORIG  :', original)
parsed = urlparse(original)
print('PARSED:', type(parsed), parsed)
t = parsed[:]
print('TUPLE :', type(t), t)
print('NEW   :', urlunparse(t))

Here, there are no parameters, queries, and fragments in the original URL.The new URLs do not look the same as the original URLs, but they are equivalent by standard.

1.3 Connection

In addition to parsing URLs, urlparse also includes a urljoin() method, which can construct absolute URLs from relative fragments.

from urllib.parse import urljoin

print(urljoin('http://www.example.com/path/file.html',
              'anotherfile.html'))
print(urljoin('http://www.example.com/path/file.html',
              '../anotherfile.html'))

In this example, the relative part of the path ('. /') is taken into account when calculating the second URL.

Non-relative paths are handled in the same way as os.path.join().(

from urllib.parse import urljoin

print(urljoin('http://www.example.com/path/',
              '/subpath/file.html'))
print(urljoin('http://www.example.com/path/',
              'subpath/file.html'))

If the path to the URL begins with a slash (/), urljoin() resets the path to the URL to the top-level path.If it does not start with a slash, the new path value is appended to the end of the URL's current path.

1.4 Decode Query Parameters

The parameter needs to be encoded before it can be added to a URL.(

from urllib.parse import urlencode

query_args = {
    'q': 'query string',
    'foo': 'bar',
}
encoded_args = urlencode(query_args)
print('Encoded:', encoded_args)

Encoding replaces special characters such as spaces to ensure they are passed to the server in a standard format.

If you want to pass a sequence of values using variables in the query string, you need to set doseq to True when urlencode() is called.

from urllib.parse import urlencode

query_args = {
    'foo': ['foo1', 'foo2'],
}
print('Single  :', urlencode(query_args))
print('Sequence:', urlencode(query_args, doseq=True))

The result is a query string containing multiple values associated with a name.

To decode this query string, you can use parse_qs() or parse_qsl().(

from urllib.parse import parse_qs, parse_qsl

encoded = 'foo=foo1&foo=foo2'

print('parse_qs :', parse_qs(encoded))
print('parse_qsl:', parse_qsl(encoded))

The return value of parse_qs() is a dictionary that maps names to values, while parse_qsl() returns a list of tuples, each containing a name and a value.

There may be special characters in query parameters that cause problems on the server side when parsing URLs, so quote these special characters when passing to urlencode().To quote them locally to establish a secure version of these strings, you can use the quote() or quote_plus() functions directly.

from urllib.parse import quote, quote_plus, urlencode

url = 'http://localhost:8080/~hellmann/'
print('urlencode() :', urlencode({'url': url}))
print('quote()     :', quote(url))
print('quote_plus():', quote_plus(url))

Quoted characters in quote_plus() provide a greater degree of substitution.

 

To complete the inverse process of quoting, use unquote() or unquote_plus() when appropriate.

from urllib.parse import unquote, unquote_plus

print(unquote('http%3A//localhost%3A8080/%7Ehellmann/'))
print(unquote_plus(
    'http%3A%2F%2Flocalhost%3A8080%2F%7Ehellmann%2F'
))

The encoded value is converted back to a normal URL string.

 

Tags: Python Fragment network Attribute encoding

Posted on Tue, 07 Apr 2020 19:14:11 -0700 by jjbarnone