python-webencodings

This is a Python implementation of the WHATWG Encoding standard.

In order to be compatible with legacy web content when interpreting something like Content-Type: text/html; charset=latin1, tools need to use a particular set of aliases for encoding labels as well as some overriding rules. For example, US-ASCII and iso-8859-1 on the web are actually aliases for windows-1252, and an UTF-8 or UTF-16 BOM takes precedence over any other encoding declaration. The Encoding standard defines all such details so that implementations do not have to reverse-engineer each other.

This module has encoding labels and BOM detection, but the actual implementation for encoders and decoders is Python’s.

Byte order marks

When decoding, for compatibility with deployed content, a byte order mark (also known as BOM) is considered more authoritative than anything else. The corresponding U+FFFE code point is not part of the decoded output.

Encoding nevers prepends a BOM, but the output can start with a BOM if the input starts with a U+FFFE code point. In that case encoding then decoding will not round-trip.

Error handling

As in the stdlib, error handling for encoding defaults to strict: raise an exception if there is an error.

For decoding however the default is replace, unlike the stdlib. Invalid bytes are decoded as (U+FFFD, the replacement character). The reason is that when showing legacy content to the user, it might be better to succeed decoding only part of it rather than blow up. This is of course not the case is all situations: sometimes you want stuff to blow up so you can detect errors early.

API

webencodings.lookup(label)[source]

Look for an encoding by its label. This is the spec’s get an encoding algorithm. Supported labels are listed there.

Parameters:label – A string.
Returns:An Encoding object, or None for an unknown label.
class webencodings.Encoding[source]

Reresents a character encoding such as UTF-8, that can be used for decoding or encoding.

name

Canonical name of the encoding

codec_info

The actual implementation of the encoding, a stdlib CodecInfo object. See codecs.register().

webencodings.UTF8 = <Encoding utf-8>

The UTF-8 encoding. Should be used for new content and formats.

webencodings.decode(input, fallback_encoding, errors=u'replace')[source]

Decode a single string.

Parameters:
  • input – A byte string
  • fallback_encoding – An Encoding object or a label string. The encoding to use if input does note have a BOM.
  • errors – Type of error handling. See codecs.register().
Raises:

LookupError for an unknown encoding label.

Returns:

A (output, encoding) tuple of an Unicode string and an Encoding.

webencodings.encode(input, encoding=<Encoding utf-8>, errors=u'strict')[source]

Encode a single string.

Parameters:
  • input – An Unicode string.
  • encoding – An Encoding object or a label string.
  • errors – Type of error handling. See codecs.register().
Raises:

LookupError for an unknown encoding label.

Returns:

A byte string.

webencodings.iter_decode(input, fallback_encoding, errors=u'replace')[source]

“Pull”-based decoder.

Parameters:
  • input

    An iterable of byte strings.

    The input is first consumed just enough to determine the encoding based on the precense of a BOM, then consumed on demand when the return value is.

  • fallback_encoding – An Encoding object or a label string. The encoding to use if input does note have a BOM.
  • errors – Type of error handling. See codecs.register().
Raises:

LookupError for an unknown encoding label.

Returns:

An (output, encoding) tuple. output is an iterable of Unicode strings, encoding is the Encoding that is being used.

webencodings.iter_encode(input, encoding=<Encoding utf-8>, errors=u'strict')[source]

“Pull”-based encoder.

Parameters:
  • input – An iterable of Unicode strings.
  • encoding – An Encoding object or a label string.
  • errors – Type of error handling. See codecs.register().
Raises:

LookupError for an unknown encoding label.

Returns:

An iterable of byte strings.

class webencodings.IncrementalDecoder(fallback_encoding, errors=u'replace')[source]

“Push”-based decoder.

Parameters:
  • fallback_encoding – An Encoding object or a label string. The encoding to use if input does note have a BOM.
  • errors – Type of error handling. See codecs.register().
Raises:

LookupError for an unknown encoding label.

decode(input, final=False)[source]

Decode one chunk of the input.

Parameters:
  • input – A byte string.
  • final – Indicate that no more input is available. Must be True if this is the last call.
Returns:

An Unicode string.

encoding = None

The actual Encoding that is being used, or None if that is not determined yet. (Ie. if there is not enough input yet to determine if there is a BOM.)

class webencodings.IncrementalEncoder(encoding=<Encoding utf-8>, errors=u'strict')[source]

“Push”-based encoder.

Parameters:
Raises:

LookupError for an unknown encoding label.

encode(input, final=False)
Parameters:
  • input – An Unicode string.
  • final – Indicate that no more input is available. Must be True if this is the last call.
Returns:

A byte string.

webencodings.ascii_lower(string)[source]

Transform (only) ASCII letters to lower case: A-Z is mapped to a-z.

Parameters:string – An Unicode string.
Returns:A new Unicode string.

This is used for ASCII case-insensitive matching of encoding labels. The same matching is also used, among other things, for CSS keywords.

This is different from the lower() method of Unicode strings which also affect non-ASCII characters, sometimes mapping them into the ASCII range:

>>> keyword = u'Bac\N{KELVIN SIGN}ground'
>>> assert keyword.lower() == u'background'
>>> assert ascii_lower(keyword) != keyword.lower()
>>> assert ascii_lower(keyword) == u'bac\N{KELVIN SIGN}ground'