This is a Python implementation of the WHATWG Encoding standard.
In order to be compatible with legacy web content when interpreting something like Content-Type: text/html; charset=latin1, tools need to use a particular set of aliases for encoding labels as well as some overriding rules. For example, US-ASCII and iso-8859-1 on the web are actually aliases for windows-1252, and an UTF-8 or UTF-16 BOM takes precedence over any other encoding declaration. The Encoding standard defines all such details so that implementations do not have to reverse-engineer each other.
This module has encoding labels and BOM detection, but the actual implementation for encoders and decoders is Python’s.
When decoding, for compatibility with deployed content, a byte order mark (also known as BOM) is considered more authoritative than anything else. The corresponding U+FFFE code point is not part of the decoded output.
Encoding nevers prepends a BOM, but the output can start with a BOM if the input starts with a U+FFFE code point. In that case encoding then decoding will not round-trip.
As in the stdlib, error handling for encoding defaults to strict: raise an exception if there is an error.
For decoding however the default is replace, unlike the stdlib. Invalid bytes are decoded as � (U+FFFD, the replacement character). The reason is that when showing legacy content to the user, it might be better to succeed decoding only part of it rather than blow up. This is of course not the case is all situations: sometimes you want stuff to blow up so you can detect errors early.
Look for an encoding by its label. This is the spec’s get an encoding algorithm. Supported labels are listed there.
| Parameters: | label – A string. |
|---|---|
| Returns: | An Encoding object, or None for an unknown label. |
Canonical name of the encoding
The actual implementation of the encoding, a stdlib CodecInfo object. See codecs.register().
The UTF-8 encoding. Should be used for new content and formats.
Decode a single string.
| Parameters: |
|
|---|---|
| Raises : | LookupError for an unknown encoding label. |
| Returns: | An Unicode string |
Encode a single string.
| Parameters: |
|
|---|---|
| Raises : | LookupError for an unknown encoding label. |
| Returns: | A byte string. |
“Pull”-based decoder.
| Parameters: |
|
|---|---|
| Raises : | LookupError for an unknown encoding label. |
| Returns: | An iterable of Unicode strings. |
“Pull”-based encoder.
| Parameters: |
|
|---|---|
| Raises : | LookupError for an unknown encoding label. |
| Returns: | An iterable of byte strings. |
“Push”-based decoder.
| Parameters: |
|
|---|---|
| Raises : | LookupError for an unknown encoding label. |
| Returns: | An incremental decoder callable like this: |
“Push”-based encoder.
| Parameters: |
|
|---|---|
| Raises : | LookupError for an unknown encoding label. |
| Returns: | An incremental encoder callable like this: |