uritools — RFC 3986 compliant replacement for urlparse

This module defines RFC 3986 compliant replacements for the most commonly used functions of the Python 2.7 Standard Library urlparse and Python 3 urllib.parse modules.

>>> from uritools import urisplit, uriunsplit, urijoin, uridefrag
>>> parts = urisplit('foo://user@example.com:8042/over/there?name=ferret#nose')
>>> parts
SplitResult(scheme='foo', authority='user@example.com:8042',
            path='/over/there', query='name=ferret', fragment='nose')
>>> parts.scheme
'foo'
>>> parts.authority
'user@example.com:8042'
>>> parts.userinfo
'user'
>>> parts.host
'example.com'
>>> parts.port
'8042'
>>> uriunsplit(parts[:3] + ('name=swallow&type=African', 'beak'))
'foo://user@example.com:8042/over/there?name=swallow&type=African#beak'
>>> urijoin('http://www.cwi.nl/~guido/Python.html', 'FAQ.html')
'http://www.cwi.nl/~guido/FAQ.html'
>>> uridefrag('http://pythonhosted.org/uritools/index.html#constants')
DefragResult(uri='http://pythonhosted.org/uritools/index.html',
             fragment='constants')

For various reasons, the Python 2 urlparse module is not compliant with current Internet standards, does not include Unicode support, and is generally unusable with proprietary URI schemes. Python 3’s urllib.parse improves on Unicode support, but the other issues still remain. As stated in Lib/urllib/parse.py:

FC 3986 is considered the current standard and any future changes
to urlparse module should conform with it.  The urlparse module is
currently not entirely compliant with this RFC due to defacto
scenarios for parsing, and for backward compatibility purposes,
some parsing quirks from older RFCs are retained.

This module aims to provide fully RFC 3986 compliant replacements for the most commonly used functions found in urlparse and urllib.parse, plus additional functions for conveniently composing URIs from their individual components.

See also

RFC 3986 - Uniform Resource Identifier (URI): Generic Syntax
The current Internet standard (STD66) defining URI syntax, to which any changes to uritools should conform. If deviations are observed, the module’s implementation should be changed, even if this means breaking backward compatiblity.

URI Decomposition

uritools.urisplit(uristring)

Split a well-formed URI string into a tuple with five components corresponding to a URI’s general structure:

<scheme>://<authority>/<path>?<query>#<fragment>

The return value is an instance of a subclass of collections.namedtuple with the following read-only attributes:

Attribute Index Value
scheme 0 URI scheme, or None if not present
authority 1 Authority component, or None if not present
path 2 Path component, always present but may be empty
query 3 Query component, or None if not present
fragment 4 Fragment identifier, or None if not present
userinfo   Userinfo subcomponent of authority, or None if not present
host   Host subcomponent of authority, or None if not present
port   Port subcomponent of authority as a (possibly empty) string, or None if not present
uritools.uridefrag(uristring)

Remove an existing fragment component from a URI string.

The return value is an instance of a subclass of collections.namedtuple with the following read-only attributes:

Attribute Index Value
uri 0 Absolute URI or relative URI reference without the fragment identifier
fragment 1 Fragment identifier, or None if not present

URI Composition

uritools.uriunsplit(parts)

Combine the elements of a five-item iterable into a URI string.

uritools.urijoin(base, ref, strict=False)

Convert a URI reference relative to a base URI to its target URI string.

If strict is False, a scheme in the reference is ignored if it is identical to the base URI’s scheme.

uritools.uricompose(scheme=None, authority=None, path='', query=None, fragment=None, userinfo=None, host=None, port=None, delim=b'&', encoding='utf-8')

Compose a URI string from its individual components.

authority may be a Unicode string, bytes object, or a three-item iterable specifying userinfo, host and port subcomponents. If both authority and any of the userinfo, host or port keyword arguments are given, the keyword argument will override the corresponding authority subcomponent.

If query is a mapping object or a sequence of two-element tuples, it will be converted to a string of key=value pairs seperated by delim.

The returned value is of type str.

URI Encoding

uritools.uriencode(uristring, safe=b'', encoding='utf-8', errors='strict')

Encode a URI string or string component.

If uristring is a bytes object, replace any characters not in UNRESERVED or safe with their corresponding percent-encodings and return the result as a bytes object. Otherwise, encode uristring using the codec registered for encoding before replacing any percent encodings.

Note that uristring may be either a Unicode string or a bytes object, while safe must be a bytes object containg ASCII characters only.

uritools.uridecode(uristring, encoding='utf-8', errors='strict')

Decode a URI string or string component.

If encoding is set to None, return the percent-decoded uristring as a bytes object. Otherwise, replace any percent-encodings and decode uristring using the codec registered for encoding, returning a Unicode string.

Character Constants

uritools.RESERVED

Reserved characters specified in RFC 3986 as a bytes object.

uritools.GEN_DELIMS

General delimiting characters specified in RFC 3986 as a bytes object.

uritools.SUB_DELIMS

Subcomponent delimiting characters specified in RFC 3986 as a bytes object.

uritools.UNRESERVED

Unreserved characters specified in RFC 3986 as a bytes object.

Structured Parse Results

The result objects from the urisplit() and uridefrag() functions are instances of subclasses of collections.namedtuple. These objects contain the attributes described in the function documentation, as well as some additional convenience methods.

class uritools.SplitResult

Base class to hold urisplit() results.

gethost(default=None)

Return the decoded host subcomponent of the URI authority, or default if the original URI did not contain a host.

gethostip(default=None)

Return the decoded host subcomponent of the URI authority as a string or an ipaddress address object, or default if the original URI did not contain a host.

getfragment(default=None, encoding='utf-8', errors='strict')

Return the decoded fragment identifier, or default if the original URI did not contain a fragment component.

getpath(encoding='utf-8', errors='strict')

Return the decoded URI path.

getport(default=None)

Return the port subcomponent of the URI authority as an int, or default if the original URI did not contain a port or if the port was empty.

getquery(default=None, encoding='utf-8', errors='strict')

Return the decoded query string, or default if the original URI did not contain a query component.

getquerydict(delims=b';&', encoding='utf-8', errors='strict')

Split the query string into individual components using the delimiter characters in delims, and return a dictionary of query parameters.

getquerylist(delims=b';&', encoding='utf-8', errors='strict')

Split the query string into individual components using the delimiter characters in delims and return a list of (name, value) pairs, with names and values seperated by '='.

getscheme(default=None)

Return the URI scheme in canonical (lowercase) form, or default if the original URI did not contain a scheme component.

geturi()

Return the re-combined version of the original URI as a string.

getuserinfo(default=None, encoding='utf-8', errors='strict')

Return the decoded userinfo subcomponent of the URI authority, or default if the original URI did not contain a userinfo field.

transform(ref, strict=False)

Convert a URI reference relative to self into a SplitResult representing its target.

class uritools.DefragResult

Class to hold uridefrag() results.

getfragment(default=None, encoding='utf-8', errors='strict')

Return the decoded fragment identifier, or default if the original URI did not contain a fragment component.

geturi()

Return the recombined version of the original URI as a string.