python-webencodings¶

This is a Python implementation of the WHATWG Encoding standard.

Latest documentation: http://packages.python.org/webencodings/
Source code and issue tracker: https://github.com/gsnedders/python-webencodings
PyPI releases: http://pypi.python.org/pypi/webencodings
License: BSD
Python 2.6+ and 3.3+

In order to be compatible with legacy web content when interpreting something like Content-Type: text/html; charset=latin1, tools need to use a particular set of aliases for encoding labels as well as some overriding rules. For example, US-ASCII and iso-8859-1 on the web are actually aliases for windows-1252, and an UTF-8 or UTF-16 BOM takes precedence over any other encoding declaration. The Encoding standard defines all such details so that implementations do not have to reverse-engineer each other.

This module has encoding labels and BOM detection, but the actual implementation for encoders and decoders is Python’s.

Byte order marks¶

When decoding, for compatibility with deployed content, a byte order mark (also known as BOM) is considered more authoritative than anything else. The corresponding U+FFFE code point is not part of the decoded output.

Encoding nevers prepends a BOM, but the output can start with a BOM if the input starts with a U+FFFE code point. In that case encoding then decoding will not round-trip.

Error handling¶

As in the stdlib, error handling for encoding defaults to strict: raise an exception if there is an error.

For decoding however the default is replace, unlike the stdlib. Invalid bytes are decoded as � (U+FFFD, the replacement character). The reason is that when showing legacy content to the user, it might be better to succeed decoding only part of it rather than blow up. This is of course not the case is all situations: sometimes you want stuff to blow up so you can detect errors early.

API¶

webencodings.lookup(label)[source]¶

Look for an encoding by its label. This is the spec’s get an encoding algorithm. Supported labels are listed there.

Parameters:	label – A string.
Returns:	An `Encoding` object, or `None` for an unknown label.

class webencodings.Encoding[source]¶

Reresents a character encoding such as UTF-8, that can be used for decoding or encoding.

name¶: Canonical name of the encoding

codec_info¶: The actual implementation of the encoding, a stdlib CodecInfo object. See codecs.register().

webencodings.UTF8 = <Encoding utf-8>¶: The UTF-8 encoding. Should be used for new content and formats.

webencodings.decode(input, fallback_encoding, errors=u'replace')[source]¶

Decode a single string.

Parameters:	input – A byte string fallback_encoding – An `Encoding` object or a label string. The encoding to use if `input` does note have a BOM. errors – Type of error handling. See `codecs.register()`.
Raises:	`LookupError` for an unknown encoding label.
Returns:	A `(output, encoding)` tuple of an Unicode string and an `Encoding`.

webencodings.encode(input, encoding=<Encoding utf-8>, errors=u'strict')[source]¶

Encode a single string.

Parameters:	input – An Unicode string. encoding – An `Encoding` object or a label string. errors – Type of error handling. See `codecs.register()`.
Raises:	`LookupError` for an unknown encoding label.
Returns:	A byte string.

webencodings.iter_decode(input, fallback_encoding, errors=u'replace')[source]¶

“Pull”-based decoder.

Parameters:	input – An iterable of byte strings. The input is first consumed just enough to determine the encoding based on the precense of a BOM, then consumed on demand when the return value is. fallback_encoding – An `Encoding` object or a label string. The encoding to use if `input` does note have a BOM. errors – Type of error handling. See `codecs.register()`.
Raises:	`LookupError` for an unknown encoding label.
Returns:	An `(output, encoding)` tuple. `output` is an iterable of Unicode strings, `encoding` is the `Encoding` that is being used.

webencodings.iter_encode(input, encoding=<Encoding utf-8>, errors=u'strict')[source]¶

“Pull”-based encoder.

Parameters:	input – An iterable of Unicode strings. encoding – An `Encoding` object or a label string. errors – Type of error handling. See `codecs.register()`.
Raises:	`LookupError` for an unknown encoding label.
Returns:	An iterable of byte strings.

class webencodings.IncrementalDecoder(fallback_encoding, errors=u'replace')[source]¶

“Push”-based decoder.

Parameters:	fallback_encoding – An `Encoding` object or a label string. The encoding to use if `input` does note have a BOM. errors – Type of error handling. See `codecs.register()`.
Raises:	`LookupError` for an unknown encoding label.

decode(input, final=False)[source]¶

Decode one chunk of the input.

Parameters:	input – A byte string. final – Indicate that no more input is available. Must be `True` if this is the last call.
Returns:	An Unicode string.

encoding = None¶: The actual Encoding that is being used, or None if that is not determined yet. (Ie. if there is not enough input yet to determine if there is a BOM.)

class webencodings.IncrementalEncoder(encoding=<Encoding utf-8>, errors=u'strict')[source]¶

“Push”-based encoder.

Parameters:	encoding – An `Encoding` object or a label string. errors – Type of error handling. See `codecs.register()`.
Raises:	`LookupError` for an unknown encoding label.

encode(input, final=False)¶

Parameters:	input – An Unicode string. final – Indicate that no more input is available. Must be `True` if this is the last call.
Returns:	A byte string.

webencodings.ascii_lower(string)[source]¶

Transform (only) ASCII letters to lower case: A-Z is mapped to a-z.

Parameters:	string – An Unicode string.
Returns:	A new Unicode string.

This is used for ASCII case-insensitive matching of encoding labels. The same matching is also used, among other things, for CSS keywords.

This is different from the lower() method of Unicode strings which also affect non-ASCII characters, sometimes mapping them into the ASCII range:

>>> keyword = u'Bac\N{KELVIN SIGN}ground'
>>> assert keyword.lower() == u'background'
>>> assert ascii_lower(keyword) != keyword.lower()
>>> assert ascii_lower(keyword) == u'bac\N{KELVIN SIGN}ground'

python-webencodings¶

Byte order marks¶

Error handling¶

API¶

Table Of Contents

This Page