Anolis 1.2

Documentation — 28 April 2013

1 Introduction
2 Installing Anolis
3 Using Anolis
4 Processes
Acknowledgements

1 Introduction

The need for Anolis came from the need for long technical documents to include niceties such as cross-references and a table of contents for the purpose of easy navigation — doing this manually can be a great chore especially when sections are numbered and a section is added, consequently changing the numbering of many others, leading to it being advantageous to do it programmatically.

Anolis does this on HTML documents, as a number of sequential processes. Currently cross-referencing, section numbering, table of contents creation, and a number of substitutions are done (mainly relating to the current date).

2 Installing Anolis

2.1 Requirements

The following are the minimum requirements: later versions should also work without issue.

Python 2.6
lxml 2.0
html5lib 0.10

2.2 Obtaining a copy

Releases are occasionally made. A link to the latest release can be found on PyPI.

Alternatively, a copy can be obtained from our Mercurial repository: this is where our ongoing development occurs, and allows any revision (and therefore any release) to be downloaded. Our repository is located at https://bitbucket.org/ms2ger/anolis/.

2.3 Installation

Normally, installation is done through distribute, with the following command:

python setup.py install

Please see distribute' documentation for information on installation options (such as installing in non-standard locations).

2.4 Running the test suite

The source distribution and the current development copy (in Mercurial) both contain a test suite. It can be run with the following command:

python runtests.py

Any test failures should be reported at our bug tracker.

3 Using Anolis

Anolis is invoked through the anolis command. The --help (or -h) option gives some basic help.

The --enable and --disable options enable/disable respectively the process given as the option value (e.g., --disable=toc disables building the table of contents and numbering sections). The default processes are sub (substitution), toc (table of contents/section numbering), and xref (cross-referencing). Any enabled process loaded via from processes import foo, and if that fails import foo (where foo is the process), and is then called as foo.foo(ElementTree, **kwargs).

Some options alter what is used to parse and serialize the document: the --parser option allows either html5lib (the default) or lxml.html (this is quicker, but does not comply to the HTML specification) to be used to parse the input file, and the --serializer option allows the same two values, but controls the serializer used for output (note that lxml.html has some rather severe issues as a serializer).

The --output-encoding option sets the character encoding used for output — this defaults to UTF-8. Treatment of characters that cannot be represented in the set output encoding is dependant on the serializer selected via the --serializer option.

Anolis offers a compatibility mode, which aims to be compatible with the CSS3 module postprocessor (within reason). This is mainly provided for the sake of pre-existing W3C documents. The --w3c-compat option turns on this compatibility mode, although specific options that turn on just one compatibility feature at a time are also available (and are documented below under each process) — these are all implied by the --w3c-compat options, with one exception: --w3c-compat-crazy-substitutions, as it can lead to undesirable results.

The options --newline-char and --indent-char set the newline and indent strings (they do not have to be a single character) respectively. They default to U+000A LINE FEED (LF) and U+0020 SPACE respectively. These are only used when generating large trees of generated markup, such as the table of contents.

Other process specific options are documented under the process to which they belong.

Upon a fatal error, processing of the document is terminated and the output file is left unchanged.

Interactive content is as defined in HTML: the a, bb, details, and datagrid elements; the audio and video elements when they have a controls attribute; the menu element when the type attribute is case-insensitively equal to toolbar.

When an id attribute is needed, it is created as follows:

Let i be equal to 0.
If the element already has an id attribute, return its value, and terminate this algorithm.
If the title attribute is present and its value is not empty and does not consist of ASCII whitespace only, let generated_id be equal to its value; otherwise, let generated_id be equal to textContent.
The generated_id is stripped of leading/trailing ASCII whitespace and converted to lowercase (behaviour of this is dependent on the current locale setting of Python).
The first of the following list whose condition matches the current state of the string is done:
1. If generated_id is an empty string, generated_id is set to generatedID.
2. If the --force-html4-id option is used, or the DOCTYPE's public identifier is one of:
  - -//W3C//DTD HTML 4.0//EN
  - -//W3C//DTD HTML 4.0 Transitional//EN
  - -//W3C//DTD HTML 4.0 Frameset//EN
  - -//W3C//DTD HTML 4.01//EN
  - -//W3C//DTD HTML 4.01 Transitional//EN
  - -//W3C//DTD HTML 4.01 Frameset//EN
  - ISO/IEC 15445:2000//DTD HyperText Markup Language//EN
  - ISO/IEC 15445:2000//DTD HTML//EN
  - -//W3C//DTD XHTML 1.0 Strict//EN
  - -//W3C//DTD XHTML 1.0 Transitional//EN
  - -//W3C//DTD XHTML 1.0 Frameset//EN
  - -//W3C//DTD XHTML 1.1//EN
  Then:
  1. All runs of characters apart from U+002D HYPHEN-MINUS (-), U+002E FULL STOP (.), U+0030 DIGIT ZERO to U+0039 DIGIT NINE (0–9), U+003A COLON (:), U+0041 LATIN CAPITAL LETTER A to U+005A LATIN CAPITAL LETTER Z (A–Z), U+005F LOW LINE (_), and U+0061 LATIN SMALL LETTER A to U+007A LATIN SMALL LETTER Z (a–z) are replaced by a single U+002D HYPHEN-MINUS (-) character within generated_id.
  2. Leading and trailing U+002D HYPHEN-MINUS (-) characters are removed from generated_id.
  3. If generated_id is not empty and if the first character is not in the range U+0041 LATIN CAPITAL LETTER A to U+005A LATIN CAPITAL LETTER Z (A–Z) or U+0061 LATIN SMALL LETTER A to U+007A LATIN SMALL LETTER Z (a–z), generated_id is prefixed by a single U+0078 LATIN SMALL LETTER X (x) character.
3. Otherwise, runs of characters that do not match the ifragment production in RFC 3987 are replaced by a single U+002D HYPHEN-MINUS (-) character within generated_id, and then leading and trailing U+002D HYPHEN-MINUS (-) characters are removed from generated_id.
If generated_id is empty, generated_id is set to generatedID.
Let output_id equal generated_id.
If output_id matches a ready-existing ID, continue to the next step; otherwise, jump to step 12.
Increment i by one.
Let output_id equal generated_id suffixed with a U+002D HYPHEN-MINUS (-) character followed by i as a big-endian base 10 number.
Jump back to step 8.
The generated ID is output_id.

4 Processes

The elements listed in the below processes, except where otherwise stated, are the local name of the element in null namespace.

4.1 Cross-referencing

Cross-referencing has three essential parts: definitions that define terms, and instances of those terms.

Terms are taken from the data-anolis-xref attribute if present, failing that the title attribute if that is present, otherwise from the textContent property of the dfn element. By default, Anolis will throw a fatal error if a term is defined more than once: this behaviour can be turned off (causing the final definition of the term to be the one that is used) by the --allow-duplicate-dfns option.

Definitions are marked-up with the dfn element.

Instances are marked-up with various elements, depending on the setting of --w3c-compat-xref-elements: if it is disabled (the default), the abbr, code, i, span, and var elements are used for instances; if it is enabled, the abbr, acronym, b, bdo, big, code, del, em, i, ins, kbd, label, legend, q, samp, small, span, strong, sub, sup, tt, var elements are used for instances. Those that are only there in compatibility mode are there because either they should not semantically be used for an instance, or because they are not present in HTML. An instance is only used if it does not have an interactive content or dfn element as either a parent or a child.

Both definitions and instances are normalized as follows:

Leading and trailing ASCII whitespace is stripped,
Converted to lowercase (behaviour of this is dependent on the current locale setting of Python),
All consecutive ASCII whitespace is replaced with a single U+0020 SPACE CHARACTER, and
If --w3c-compat-xref-normalization is enabled, all characters apart from U+0020 SPACE CHARACTER, U+002D HYPHEN-MINUS (-), U+0030 DIGIT ZERO to U+0039 DIGIT NINE (0–9), U+0041 LATIN CAPITAL LETTER A to U+005A LATIN CAPITAL LETTER Z (A–Z), and U+0061 LATIN SMALL LETTER A to U+007A LATIN SMALL LETTER Z (a–z) are removed.

If the instance is contained within a span element, the span element is turned into an a element, and a href attribute is added to link it to the definition (e.g., foo becomes <a href=#foo>foo</a>) — all other attributes are preserved. Otherwise (when the instance is not contained within a span element), the location of the a element when linking an instance is dependent on the --w3c-compat-xref-a-placement option: if it is disabled (the default), the a element is placed around the element containing the instance (e.g., foo becomes <a href=#foo>foo</a>); if it is enabled, the a element goes within the element containing the instance and goes around all of its content (e.g., foo becomes <a href=#foo>foo</a>).

4.2 Table of contents/section numbering

To create a table of contents, and to number the sections of the document, an outline is created (this is a list of sections, which can each contain more sections, where a section represents a part of the document, and often has a heading associated with it — for more detailed definitions see HTML). This means not only are the h1–h6 elements supported, but also elements such as section are used to create the outline. After creating the outline, every section with a depth between those provided by --min-depth and --max-depth (defaulting to two and six respectively), and which has a heading, is numbered if it does not have no-num as a class, and is added to the table of contents if it does not have no-toc as a class. Sections without a heading are treated as if they did not exist, unless they have children, in which they will appear to exist while not existing all at once (e.g., they increment the section numbering, though that is not output anywhere; and they get a list item in the table of contents, with only the children within it, and no link to the section itself).

The format of section numbers should comply with ISO 2145:1978, Numbering of divisions and subdivisions in written documents. This means that each section number is given by Arabic numerals, seperated by a single U+002E FULL STOP character, and there is no trailing U+002E FULL STOP character.

The section number is inserted as the first child node of the section heading as a span element with the class attribute set to secno: this is copied into the table of contents.

Pre-existing span elements with a class of secno are removed from all section headings, regardless of whether their depth falls within the range given by --min-depth and --max-depth.

The table of contents is built up as an ordered list (an ol element), with each section marked up as a li element, and child sections are marked up with an ol within that li (and this continues recursively, ad infinitum). By default, the root element of the table of contents (an ol element) is given a class attribute set to toc; however, with the --w3c-compat-class-toc option this is placed on every ol within the table of contents. The entire section heading is copied to be the content of the list item, with all interactive content elements and id attributes removed.

A normal comment substitution is done with sub_identifier equal to toc, and the table of contents as the replacement.

4.3 Substitution

Various strings are replaced in magic ways: a normal string substitution takes the form of [xxx] where xxx is case-sensitively the replacement, which may be followed by any characters apart from U+005D RIGHT SQUARE BRACKET (]) before the final U+005D RIGHT SQUARE BRACKET character — these extra characters are effectively a comment, and carry absolutely no meaning, and vanish into some as-of-yet unknown abyss when the string replacement is done. The entire string must be contained within a single text node.

A normal comment substitution is one where there is a string, sub_identifier, that identifies the comment for the substitution, and the replacement. All nodes between a comment with a value equal to (with leading and trailing ASCII whitespace removed) begin- followed by sub_identifier and one with q value equal to (with leading and trailing ASCII whitespace removed) end- followed by sub_identifier are removed, and replaced with the replacement. Additionally, any comment (with leading and trailing ASCII whitespace removed) with a value equal to sub_identifier is replaced with a comment with a value of begin- followed by sub_identifier, the replacement, and then a comment with a value of end- followed by sub_identifier.

The W3C status is given by the --w3c-status=status argument.

If that argument isn't given, it is found, when needed by one of the substitutions, by iterating all text nodes in document order (i.e., attribute values and comments have no effect), and for each node, the following is done (in this order):

If the node contains, case-insensitively, "latest", followed by one or more ASCII whitespace characters, followed by "version", searching stops, and the default is used (ED).
Otherwise, if the node, case-sensitively, contains "http://www.w3.org/TR/" followed by one of "MO", "WD", "CR", "PR", "REC", "PER", or "NOTE", which in turn is followed by U+002D HYPHEN-MINUS (-), then searching stops, and the status is whatever matched the previous list of options by the first match in the text node.

A side-effect of doing it in this order is the fact that if a node contains both of these possible strings is that the latter is ignored, meaning that the default (ED) is used.

There is also a long W3C status, which correlates to the W3C status under the following mapping:

W3C Status	Long W3C Status
MO	W3C Member-only Draft
ED	Editor's Draft
WD	W3C Working Draft
CR	W3C Candidate Recommendation
PR	W3C Proposed Recommendation
REC	W3C Recommendation
PER	W3C Proposed Edited Recommendation
NOTE	W3C Working Group Note

By default, the normal string substitutions are:

[DATE]: This is replaced with the current date for UTC±0 in the form of, e.g., 31 July 2008. The word used for the month is dependent on the current locale of Python. The number of the day of the month has no leading zeros.
[CDATE]: This is replaced with the current date for UTC±0 in the form YYYYMMDD, e.g., 20080731. This is a conforming ISO 8601:2004 date.
[YEAR]: This is replaced with the current year for UTC±0, in the form YYYY, e.g., 2008. This is a conforming ISO 8601:2004 year.
[TITLE]: This is replaced with the textContent of the first title element which is within the first head of the document, or an empty string if such a title element does not exist.

There is one comment substitution by default. Any nodes between a comment with a value equal to (with leading and trailing ASCII whitespace removed) begin-link and one with a value equal to end-link, with interactive content elements removed (though children of those elements preserved), are effectively wrapped in an a element which has a href attribute equal to the textContent of all the nodes between the two comments concatenated in document order. The two comments must have the same parent, otherwise a fatal error occurs.

If --w3c-compat-substitutions is enabled, the following normal string substitutions are done in addition to those above:

[STATUS]: This is replaced with the W3C status.
[LONGSTATUS]: This is replaced with the long W3C status.

Additionally, the following normal comment substitutions are done:

sub_identifier equal to logo: Replacement is equal to: <a href="http://www.w3.org/"><img alt="W3C" src="http://www.w3.org/Icons/w3c_home"/></a> (parsed as an XML fragment, and serialized into the output document in the needed format).
sub_identifier equal to copyright: Replacement is equal to: <a href="http://www.w3.org/Consortium/Legal/ipr-notice#Copyright">Copyright</a> © [YEAR] <a href="http://www.w3.org/"><abbr title="World Wide Web Consortium">W3C</abbr></a>® (<a href="http://www.csail.mit.edu/"><abbr title="Massachusetts Institute of Technology">MIT</abbr></a>, <a href="http://www.ercim.org/"><abbr title="European Research Consortium for Informatics and Mathematics">ERCIM</abbr></a>, <a href="http://www.keio.ac.jp/">Keio</a>, <a href="http://ev.buaa.edu.cn/">Beihang</a>), All Rights Reserved. W3C <a href="http://www.w3.org/Consortium/Legal/ipr-notice#Legal_Disclaimer">liability</a>, <a href="http://www.w3.org/Consortium/Legal/ipr-notice#W3C_Trademarks">trademark</a> and <a href="http://www.w3.org/Consortium/Legal/copyright-documents">document use</a> rules apply. (parsed as an XML fragment, and serialized into the output document in the needed format).

There is one further string substitution, and this is only done when --w3c-compat-crazy-substitutions is enabled (note that this is not included in --w3c-compat). A string of http://www.w3.org/StyleSheets/TR/W3C- followed by one or more characters in the range U+0041 LATIN CAPITAL LETTER A to U+005A LATIN CAPITAL LETTER Z (A–Z) is replaced with whatever http://www.w3.org/StyleSheets/TR/W3C-[STATUS] would evaluate to be. Like the normal string substitutions, this string must be contained in a single text node.

4.4 References

A "references" section can be generated automagically by enabling the refs process. There are two kinds of elements relevant to this process: reference instances and reference sections. Reference data is also required.

Reference data is to be given as a references.json file, in the data folder. The expected data structure is a set of name-value pairs, with a pair for each possible reference. The name is referred to as the reference abbreviation, and the value is to be a set with the following name-value pairs:

title: the title of the referenced document;
href: the URL of the referenced document;
authors: an array of authors or editors of the referenced document;
publisher: the organisation publishing the referenced document (optional);
isbn: for dead tree publications, their ISBN (optional).

Reference instances are defined by span elements with a data-anolis-ref attribute. Their textContent must be the reference abbreviation of a reference defined in the reference data. A reference instance is either informative or normative: it is informative if the element has an informative class and normative otherwise.

Reference sections are div elements, wherein the reference lists will be constructed. In compatibility mode, two distinct reference sections are used: one with anolis-references-normative as its id, for those references with at least one normative instance, and one with anolis-references-informative as its id, for the others. When Anolis is not in compatibility mode, just one reference section is used, with an anolis-references id. In this case, the references for which all instances are informative are prefixed with "(Non-normative) ".

4.5 Cross-specification cross-referencing

The xspecxref process makes it possible to reference terms defined in other documents, with or without the the cooperation of its author. This process works in a way very similar to the xref process.

The essential parts of this process are instances of terms, which are defined in documents. These documents are referenced by their respective shortnames. The elements used to mark up the instances must have a data-anolis-spec attribute whose value is a shortnames.

In order to generate links to the other document, a JSON file with the list of defined terms for each document must be supplied in the xrefs subfolder of the data folder, which must be listed in the specs.json. This file must contain a set of name-value mappings where the name is a shortname and the value is the URL of its document's list of defined terms, relative to the xrefs folder.

For example, the specs.json file could look as follows:

{
  "webidl": "webidl.json",
  "domcore": "dom/webdomcore.json",
  "html": "dom/html5.json",
  "csscascade": "css/css3cascade.json",
  "cssbackgrounds": "css/css4backgrounds.json"
}

Each list of defined terms must contain a set of name-value pairs, with the following names:

url: the URL of the document, with a traling "#";
definitions: a set of name-value mappings where the names are the terms in the document, converted to lowercase, and the values are the IDs of the defining element in the document.

For example, the webdomcore.json file could look as follows:

{
  "url": "http://dvcs.w3.org/hg/domcore/raw-file/tip/Overview.html#",
  "definitions": {
    "converted to lowercase": "converted-to-lowercase",
    "concept-id": "concept-id"
  }
}

In order to create such a file, the --dump-xrefs option can be used. Calling Anolis with the --dump-xrefs=path argument will update the list of defined terms at path with the terms currently defined in the input document.

Before the first use, a JSON file must be provided at path, with the url property already present.

Acknowledgements

Thanks to Andrew Sidwell, Anne van Kesteren, Henri Sivonen, Ian Hickson, James Graham, Lachlan Hunt, Magnus Kristiansen, Michael(tm) Smith, and Philip Taylor for their ever needed help.

Special thanks to Geoffrey Sneddon, who created this tool.

Special thanks to Bert Bos for creating the CSS3 Module Postprocessor, on which this is partially based, and (with --w3c-compat) claims to be partially compatible with. Further special thanks to Bert Bos for creating a number of things (especially the algorithm for finding the W3C status) that took the author of Anolis many hours to reverse engineer.

Contents