python-webencodings

This is a Python implementation of the WHATWG Encoding standard.

In order to be compatible with legacy web content when interpreting something like Content-Type: text/html; charset=latin1, tools need to use a particular set of aliases for encoding labels as well as some overriding rules. For example, US-ASCII and iso-8859-1 on the web are actually aliases for windows-1252, and an UTF-8 or UTF-16 BOM takes precedence over any other encoding declaration. The Encoding standard defines all such details so that implementations do not have to reverse-engineer each other.

This module has encoding labels and BOM detection, but the actual implementation for encoders and decoders is Python’s.

Byte order marks

When decoding, for compatibility with deployed content, a byte order mark (also known as BOM) is considered more authoritative than anything else. The corresponding U+FFFE code point is not part of the decoded output.

Encoding nevers prepends a BOM, but the output can start with a BOM if the input starts with a U+FFFE code point. In that case encoding then decoding will not round-trip.

Error handling

As in the stdlib, error handling for encoding defaults to strict: raise an exception if there is an error.

For decoding however the default is replace, unlike the stdlib. Invalid bytes are decoded as (U+FFFD, the replacement character). The reason is that when showing legacy content to the user, it might be better to succeed decoding only part of it rather than blow up. This is of course not the case is all situations: sometimes you want stuff to blow up so you can detect errors early.

API

webencodings.lookup(label)[source]

Look for an encoding by its label. This is the spec’s get an encoding algorithm. Supported labels are listed there.

Parameters:label – A string.
Returns:An Encoding object, or None for an unknown label.
class webencodings.Encoding[source]
name

Canonical name of the encoding

codec_info

The actual implementation of the encoding, a stdlib CodecInfo object. See codecs.register().

webencodings.UTF8 = <Encoding utf-8>

The UTF-8 encoding. Should be used for new content and formats.

webencodings.decode(input, fallback_encoding, errors=u'replace')[source]

Decode a single string.

Parameters:
  • input – A byte string
  • fallback_encoding – An Encoding object or a label string. Ignored if input has a BOM.
  • errors – Type of error handling. See codecs.register().
Raises :

LookupError for an unknown encoding label.

Returns:

An Unicode string

webencodings.encode(input, encoding=<Encoding utf-8>, errors=u'strict')[source]

Encode a single string.

Parameters:
  • input – An Unicode string.
  • encoding – An Encoding object or a label string.
  • errors – Type of error handling. See codecs.register().
Raises :

LookupError for an unknown encoding label.

Returns:

A byte string.

webencodings.iter_decode(input, fallback_encoding, errors=u'replace')[source]

“Pull”-based decoder.

Parameters:
  • input – An iterable of byte strings.
  • fallback_encoding – An Encoding object or a label string. Ignored if input has a BOM.
  • errors – Type of error handling. See codecs.register().
Raises :

LookupError for an unknown encoding label.

Returns:

An iterable of Unicode strings.

webencodings.iter_encode(input, encoding=<Encoding utf-8>, errors=u'strict')[source]

“Pull”-based encoder.

Parameters:
  • input – An iterable of Unicode strings.
  • encoding – An Encoding object or a label string.
  • errors – Type of error handling. See codecs.register().
Raises :

LookupError for an unknown encoding label.

Returns:

An iterable of byte strings.

webencodings.make_incremental_decoder(fallback_encoding, errors=u'replace')[source]

“Push”-based decoder.

Parameters:
Raises :

LookupError for an unknown encoding label.

Returns:

An incremental decoder callable like this:

incremental_decoder(input, final=False)
Parameters:
  • input – A byte string.
  • final – Indicate that no more input is available. Must be True if this is the last call.
Returns:

An Unicode string.

webencodings.make_incremental_encoder(encoding=<Encoding utf-8>, errors=u'strict')[source]

“Push”-based encoder.

Parameters:
Raises :

LookupError for an unknown encoding label.

Returns:

An incremental encoder callable like this:

incremental_encoder(input, final=False)
Parameters:
  • input – An Unicode string.
  • final – Indicate that no more input is available. Must be True if this is the last call.
Returns:

A byte string.

Table Of Contents

This Page