uritools — RFC 3986 compliant replacement for urlparse

This module defines RFC 3986 compliant replacements for the most commonly used functions of the Python 2.7 Standard Library urlparse and Python 3 urllib.parse modules.

>>> from uritools import urisplit, uriunsplit, urijoin, uridefrag
>>> uri = urisplit('foo://example.com:8042/over/there?name=ferret#nose')
>>> uri
SplitResult(scheme='foo', authority='example.com:8042', path='/over/there',
            query='name=ferret', fragment='nose')
>>> uri.scheme
'foo'
>>> uri.authority
'example.com:8042'
>>> uri.host
'example.com'
>>> uri.port
8042
>>> uri.geturi()
'foo://example.com:8042/over/there?name=ferret#nose'
>>> uriunsplit(uri[:3] + ('name=swallow&type=African', 'beak'))
'foo://example.com:8042/over/there?name=swallow&type=African#beak'
>>> urijoin('http://www.cwi.nl/~guido/Python.html', 'FAQ.html')
'http://www.cwi.nl/~guido/FAQ.html'
>>> uridefrag('http://pythonhosted.org/uritools/index.html#constants')
DefragResult(base='http://pythonhosted.org/uritools/index.html',
             fragment='constants')
>>> urisplit('http://www.xn--lkrbis-vxa4c.at/').gethost(encoding='idna')
'www.ölkürbis.at'

For various reasons, the urlparse module is not compliant with current Internet standards, does not include Unicode support, and is generally unusable with proprietary URI schemes. As stated in Lib/urlparse.py:

RFC 3986 is considered the current standard and any future changes
to urlparse module should conform with it.  The urlparse module is
currently not entirely compliant with this RFC due to defacto
scenarios for parsing, and for backward compatibility purposes,
some parsing quirks from older RFCs are retained.

The uritools module aims to provide fully RFC 3986 compliant replacements for some commonly used functions found in urlparse, plus additional functions for handling Unicode, normalizing URI paths, and conveniently composing URIs from their individual components.

See also

RFC 3986 - Uniform Resource Identifier (URI): Generic Syntax
The current Internet standard (STD66) defining URI syntax, to which any changes to uritools should conform. If deviations are observed, the module’s implementation should be changed, even if this means breaking backward compatiblity.

Replacement Functions for urlparse

uritools.urisplit(string)

Split a well-formed URI string into a tuple with five components corresponding to a URI’s general structure:

<scheme>://<authority>/<path>?<query>#<fragment>

The return value is an instance of a subclass of collections.namedtuple with the following read-only attributes:

Attribute Index Value
scheme 0 URI scheme, or None if not present
authority 1 Authority component, or None if not present
path 2 Path component, always present but may be empty
query 3 Query component, or None if not present
fragment 4 Fragment identifier, or None if not present
userinfo   Userinfo subcomponent of authority, or None if not present
host   Host subcomponent of authority, or None if not present
port   Port subcomponent of authority as an int, or None if not present
uritools.uriunsplit(parts)

Combine the elements of a five-item iterable into a URI string.

uritools.urijoin(base, ref, strict=False)

Convert a URI reference relative to a base URI to its target URI string.

uritools.uridefrag(string)

Remove an existing fragment component from a URI string.

The return value is an instance of a subclass of collections.namedtuple with the following read-only attributes:

Attribute Index Value
base 0 The absoulte URI or relative URI reference without the fragment identifier
fragment 1 The fragment identifier, or None if not present

Encoding Functions

uritools.uriencode(string, safe='', encoding='utf-8')

Encode string using the codec registered for encoding, replacing any characters not in UNRESERVED or safe with their corresponding percent-encodings.

This function can be used as a Unicode-aware replacement for urllib.quote(). Compared to urllib.quote(), this function never encodes the tilde character (~), which is an unreserved character in RFC 3986, and encodes slash characters by default.

Note that this function should not be confused with urllib.urlencode(), which does something completely different.

uritools.uridecode(string, encoding='utf-8')

Replace any percent-encodings in string, and decode the resulting string using the codec registered for encoding.

This function can be used as a Unicode-aware replacement for urllib.unquote().

Additional Functions

uritools.urinormpath(path)

Remove ‘.’ and ‘..’ path segments from a URI path.

uritools.uricompose(scheme=None, authority=None, path='', query=None, fragment=None, delim='&', querysep='=', encoding='utf-8')

Compose a URI string from its components.

If query is a mapping object or a sequence of two-element tuples, it will be converted to a string of key=value pairs seperated by delim.

Constants

uritools.RE

Regular expression for splitting a well-formed URI into its components, as specified in RFC 3986 Appendix B.

The matched URI components are available by group index and via symbolic group names:

>>> uri = 'foo://example.com:8042/over/there?name=ferret#nose'
>>> uritools.RE.match(uri).groups()
('foo', 'example.com:8042', '/over/there', 'name=ferret', 'nose')
>>> uritools.RE.match(uri).groupdict()
{'fragment': 'nose', 'path': '/over/there', 'scheme': 'foo',
 'authority': 'example.com:8042', 'query': 'name=ferret'}
uritools.UNRESERVED = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789_.-~'

Unreserved characters specified in RFC 3986.

uritools.RESERVED = ":/?#[]@!$&'()*+,;="

Reserved characters specified in RFC 3986.

uritools.GEN_DELIMS = ':/?#[]@'

General delimiting characters specified in RFC 3986.

uritools.SUB_DELIMS = "!$&'()*+,;="

Subcomponent delimiting characters specified in RFC 3986.

Results of urisplit() and uridefrag()

The result objects from the urisplit() and uridefrag() functions are instances of subclasses of collections.namedtuple. These objects contain the attributes described in the function documentation, as well as some additional convenience methods:

class uritools.SplitResult

Class to hold urisplit() results.

getaddrinfo(family=0, type=0, proto=0, flags=0)

Translate the host and port subcomponents of the URI authority into a sequence of 5-tuples as reported by socket.getaddrinfo().

If the URI authority does not contain a port subcomponent, the URI scheme is interpreted as a service name. The optional family, type, proto and flags arguments are passed to socket.getaddrinfo() as-is,

getauthority(default=None, encoding='utf-8')

Return the decoded URI authority, or default if the original URI did not contain an authority component.

getfragment(default=None, encoding='utf-8')

Return the decoded fragment identifier, or default if the original URI did not contain a fragment component.

gethost(default=None, encoding='utf-8')

Return the decoded host subcomponent of the URI authority, or default if the original URI did not contain a host.

If the host represents an internationalized domain name intended for resolution via DNS, the 'idna' encoding must be specified to return a Unicode domain name.

getpath(encoding='utf-8')

Return the decoded URI path.

getport(default=None)

Return the port subcomponent of the URI authority as an int, or default if the original URI did not contain a port.

getquery(default=None, encoding='utf-8')

Return the decoded query string, or default if the original URI did not contain a query component.

getquerydict(delims=';&', sep='=', encoding='utf-8')

Split the query string into individual components using the delimiter characters in delims, and return a dictionary of query parameters.

The dictionary keys are the unique decoded query parameter names, and the values are lists of decoded values for each name. Parameter names and values are seperated by sep.

getquerylist(delims=';&', sep='=', encoding='utf-8')

Split the query string into individual components using the delimiter characters in delims.

If sep is not empty, split each component at the first occurence of sep and return a list of decoded (name, value) pairs. If sep is not found, value becomes None.

If sep is None or empty, return the list of decoded query components.

getscheme(default=None)

Return the URI scheme in canonical (lowercase) form, or default if the original URI did not contain a scheme component. Raise a ValueError if the scheme is not well-formed.

geturi()

Return the re-combined version of the original URI as a string.

getuserinfo(default=None, encoding='utf-8')

Return the decoded userinfo subcomponent of the URI authority, or default if the original URI did not contain a userinfo field.

transform(ref, strict=False)

Convert a URI reference relative to self into a SplitResult representing its target.

class uritools.DefragResult

Class to hold uridefrag() results.

getbase(encoding='utf-8')

Return the decoded absolute URI or relative URI reference without the fragment.

getfragment(default=None, encoding='utf-8')

Return the decoded fragment identifier, or default if the original URI did not contain a fragment component.

geturi()

Return the re-combined version of the original URI as a string.