uritools — RFC 3986 compliant replacement for urlparse

This module defines RFC 3986 compliant replacements for the most commonly used functions of the Python 2.7 Standard Library urlparse and Python 3 urllib.parse modules.

>>> from uritools import urisplit, uriunsplit, urijoin, uridefrag
>>> parts = urisplit('foo://tkem@example.com:8042/over/there?name=ferret#nose')
>>> parts
SplitResultString(scheme='foo', authority='tkem@example.com:8042',
                  path='/over/there', query='name=ferret', fragment='nose')
>>> parts.scheme
'foo'
>>> parts.authority
'tkem@example.com:8042'
>>> parts.userinfo
'tkem'
>>> parts.host
'example.com'
>>> parts.port
'8042'
>>> uriunsplit(parts[:3] + ('name=swallow&type=African', 'beak'))
'foo://tkem@example.com:8042/over/there?name=swallow&type=African#beak'
>>> urijoin('http://www.cwi.nl/~guido/Python.html', 'FAQ.html')
'http://www.cwi.nl/~guido/FAQ.html'
>>> uridefrag('http://pythonhosted.org/uritools/index.html#constants')
DefragResult(base='http://pythonhosted.org/uritools/index.html',
             fragment='constants')
>>> urisplit('http://www.xn--lkrbis-vxa4c.at/').gethost(encoding='idna')
'www.ölkürbis.at'

For various reasons, the Python 2 urlparse module is not compliant with current Internet standards, does not include Unicode support, and is generally unusable with proprietary URI schemes. Python 3’s urllib.parse improves on Unicode support, but the other issues still remain. As stated in Lib/urllib/parse.py:

FC 3986 is considered the current standard and any future changes
to urlparse module should conform with it.  The urlparse module is
currently not entirely compliant with this RFC due to defacto
scenarios for parsing, and for backward compatibility purposes,
some parsing quirks from older RFCs are retained.

This module aims to provide fully RFC 3986 compliant replacements for the most commonly used functions found in urlparse, plus additional functions for conveniently composing URIs from their individual components.

See also

RFC 3986 - Uniform Resource Identifier (URI): Generic Syntax
The current Internet standard (STD66) defining URI syntax, to which any changes to uritools should conform. If deviations are observed, the module’s implementation should be changed, even if this means breaking backward compatiblity.

URI Parsing

uritools.urisplit(string)

Split a well-formed URI string into a tuple with five components corresponding to a URI’s general structure:

<scheme>://<authority>/<path>?<query>#<fragment>

The return value is an instance of a subclass of collections.namedtuple with the following read-only attributes:

Attribute Index Value
scheme 0 URI scheme, or None if not present
authority 1 Authority component, or None if not present
path 2 Path component, always present but may be empty
query 3 Query component, or None if not present
fragment 4 Fragment identifier, or None if not present
userinfo   Userinfo subcomponent of authority, or None if not present
host   Host subcomponent of authority, or None if not present
port   Port subcomponent of authority as a (possibly empty) string, or None if not present
uritools.uridefrag(string)

Remove an existing fragment component from a URI string.

The return value is an instance of a subclass of collections.namedtuple with the following read-only attributes:

Attribute Index Value
base 0 Absoulte URI or relative URI reference without the fragment identifier
fragment 1 Fragment identifier, or None if not present

URI Composition

uritools.uricompose(scheme=None, authority=None, path='', query=None, fragment=None, delim=b'&', encoding='utf-8')

Compose a URI string from its components.

If query is a mapping object or a sequence of two-element tuples, it will be converted to a string of key=value pairs seperated by delim.

authority may be a Unicode string, a bytes-like object, or a tuple of three elements, specifying userinfo, host and port.

uritools.urijoin(base, ref, strict=False)

Convert a URI reference relative to a base URI to its target URI string.

If strict is False, a scheme in the reference is ignored if it is identical to the base URI’s scheme.

uritools.uriunsplit(parts)

Combine the elements of a five-item iterable into a URI string.

URI Encoding

uritools.uridecode(string, encoding='utf-8')

Replace any percent-encodings in string, and return a decoded version of the string, using the codec registered for encoding.

string may be either a Unicode string or a bytes-like object.

uritools.uriencode(string, safe=b'', encoding='utf-8')

Encode string using the codec registered for encoding, replacing any characters not in UNRESERVED or safe with their corresponding percent-encodings.

string may be either a Unicode string or a bytes-like object, while safe must be a bytes object containg ASCII characters only.

This function should not be confused with urllib.urlencode(), which does something completely different.

Constants

uritools.UNRESERVED

Unreserved characters specified in RFC 3986 as a bytes object.

uritools.RESERVED

Reserved characters specified in RFC 3986 as a bytes object.

uritools.GEN_DELIMS

General delimiting characters specified in RFC 3986 as a bytes object.

uritools.SUB_DELIMS

Subcomponent delimiting characters specified in RFC 3986 as a bytes object.

Structured Parse Results

The result objects from the urisplit() and uridefrag() functions are instances of subclasses of collections.namedtuple. These objects contain the attributes described in the function documentation, as well as some additional convenience methods:

class uritools.SplitResult

Base class to hold urisplit() results.

Do not try to create instances of this class directly. Use the urisplit() factory function instead.

getaddrinfo(port=None, family=0, type=0, proto=0, flags=0)

Translate the host and port subcomponents of the URI authority into a sequence of 5-tuples as reported by socket.getaddrinfo().

If the URI authority does not contain a port subcomponent, or the port subcomponent is empty, the optional port argument is used. If no port argument is given, the URI scheme is interpreted as a service name, and the port number for that service is used. If no matching service is found, None is passed to socket.getaddrinfo() for the port value.

The optional family, type, proto and flags arguments are passed to socket.getaddrinfo() unchanged.

getauthority(default=None, encoding='utf-8')

Return the decoded URI authority, or default if the original URI did not contain an authority component.

getfragment(default=None, encoding='utf-8')

Return the decoded fragment identifier, or default if the original URI did not contain a fragment component.

gethost(default=None, encoding='utf-8')

Return the decoded host subcomponent of the URI authority, or default if the original URI did not contain a host.

If the host represents an internationalized domain name intended for resolution via DNS, the 'idna' encoding must be specified to return a Unicode domain name.

getpath(encoding='utf-8')

Return the decoded URI path.

getport(default=None)

Return the port subcomponent of the URI authority as an int, or default if the original URI did not contain a port, or if the port was empty.

getquery(default=None, encoding='utf-8')

Return the decoded query string, or default if the original URI did not contain a query component.

getquerydict(delims=b';&', encoding='utf-8')

Split the query string into individual components using the delimiter characters in delims, and return a dictionary of query parameters.

The dictionary keys are the unique decoded query parameter names, and the values are lists of decoded values for each name, with names and values seperated by '='.

getquerylist(delims=b';&', encoding='utf-8')

Split the query string into individual components using the delimiter characters in delims, and return a list of (name, value) pairs, where names and values are seperated by '='.

getscheme(default=None)

Return the URI scheme in canonical (lowercase) form, or default if the original URI did not contain a scheme component.

geturi()

Return the re-combined version of the original URI as a string.

getuserinfo(default=None, encoding='utf-8')

Return the decoded userinfo subcomponent of the URI authority, or default if the original URI did not contain a userinfo field.

transform(ref, strict=False)

Convert a URI reference relative to self into a SplitResult representing its target.

If strict is False, a scheme in the reference is ignored if it is identical to self.scheme.

class uritools.DefragResult

Class to hold uridefrag() results.

Do not try to create instances of this class directly. Use the uridefrag() factory function instead.

getfragment(default=None, encoding='utf-8')

Return the decoded fragment identifier, or default if the original URI did not contain a fragment component.

geturi()

Return the recombined version of the original URI as a string.

Table Of Contents

This Page