The need for Anolis came from the need for long technical documents to include niceties such as cross-references and a table of contents for the purpose of easy navigation — doing this manually can be a great chore especially when sections are numbered and a section is added, consequently changing the numbering of many others, leading to it being advantageous to do it programmatically.
Anolis does this on HTML documents, as a number of sequential processes. Currently cross-referencing, section numbering, table of contents creation, and a number of substitutions are done (mainly relating to the current date).
The following are the minimum requirements: later versions should also work without issue.
Releases are occasionally made. A link to the latest release can be found on PyPI.
Alternatively, a copy can be obtained from our Mercurial repository: this
is where our ongoing development occurs, and allows any revision (and therefore
any release) to be downloaded. Our repository is located at
https://bitbucket.org/ms2ger/anolis/
.
Normally, installation is done through distribute, with the following command:
python setup.py install
Please see distribute' documentation for information on installation options (such as installing in non-standard locations).
The source distribution and the current development copy (in Mercurial) both contain a test suite. It can be run with the following command:
python runtests.py
Any test failures should be reported at our bug tracker.
Anolis is invoked through the anolis
command. The
--help
(or -h
) option gives
some basic help.
The --enable
and --disable
options enable/disable respectively the process given as the option value
(e.g., --disable=toc
disables building the table of contents and
numbering sections). The default processes are sub
(substitution), toc
(table of contents/section
numbering), and xref
(cross-referencing). Any
enabled process loaded via from processes import foo
, and if that
fails import foo
(where foo
is the process), and is
then called as foo.foo(ElementTree, **kwargs)
.
Some options alter what is used to parse and serialize the document: the --parser option allows either html5lib (the default) or lxml.html (this is quicker, but does not comply to the HTML specification) to be used to parse the input file, and the --serializer option allows the same two values, but controls the serializer used for output (note that lxml.html has some rather severe issues as a serializer).
The --output-encoding option sets the character encoding used for output — this defaults to UTF-8. Treatment of characters that cannot be represented in the set output encoding is dependant on the serializer selected via the --serializer option.
Anolis offers a compatibility mode, which aims to be compatible
with the CSS3
module postprocessor (within reason). This is mainly provided for the sake
of pre-existing W3C documents. The
--w3c-compat
option turns on this compatibility mode,
although specific options that turn on just one compatibility feature at a time
are also available (and are documented below under each process) — these are all implied by the
--w3c-compat
options, with one exception:
--w3c-compat-crazy-substitutions
, as it can lead to undesirable
results.
The options --newline-char
and
--indent-char
set the newline and indent strings (they
do not have to be a single character) respectively. They default to U+000A LINE
FEED (LF) and U+0020 SPACE respectively. These are only used when generating
large trees of generated markup, such as the table of contents.
Other process specific options are documented under the process to which they belong.
Upon a fatal error, processing of the document is terminated and the output file is left unchanged.
Interactive content is as defined in HTML: the a
, bb
,
details
, and datagrid
elements; the
audio
and video
elements when they have a
controls
attribute; the menu
element when the
type
attribute is case-insensitively equal to
toolbar
.
When an id
attribute is
needed, it is created as follows:
id
attribute, return its
value, and terminate this algorithm.
title
attribute is present and its value is not
empty and does not consist of
ASCII whitespace only, let
generated_id be equal to its value; otherwise, let
generated_id be equal to
textContent.
generatedID
.
--force-html4-id
option is used, or the DOCTYPE's public identifier is one of:
-//W3C//DTD HTML 4.0//EN
-//W3C//DTD HTML 4.0 Transitional//EN
-//W3C//DTD HTML 4.0 Frameset//EN
-//W3C//DTD HTML 4.01//EN
-//W3C//DTD HTML 4.01 Transitional//EN
-//W3C//DTD HTML 4.01 Frameset//EN
ISO/IEC 15445:2000//DTD HyperText Markup
Language//EN
ISO/IEC 15445:2000//DTD HTML//EN
-//W3C//DTD XHTML 1.0 Strict//EN
-//W3C//DTD XHTML 1.0 Transitional//EN
-//W3C//DTD XHTML 1.0 Frameset//EN
-//W3C//DTD XHTML 1.1//EN
generatedID
.
The elements listed in the below processes, except where otherwise stated, are the local name of the element in null namespace.
Cross-referencing has three essential parts: definitions that define terms, and instances of those terms.
Terms are taken from the
data-anolis-xref
attribute if present, failing that the
title
attribute if that is present, otherwise from the
textContent property of the
dfn
element. By default, Anolis will throw a
fatal error if a term is defined more than once: this
behaviour can be turned off (causing the final definition of the
term to be the one that is used) by the
--allow-duplicate-dfns
option.
Definitions are marked-up with
the dfn
element.
Instances are marked-up with
various elements, depending on the setting of
--w3c-compat-xref-elements
: if it is disabled (the
default), the abbr
, code
, i
,
span
, and var
elements are used for instances; if it is enabled, the
abbr
, acronym
, b
, bdo
,
big
, code
, del
, em
,
i
, ins
, kbd
, label
,
legend
, q
, samp
, small
,
span
, strong
, sub
, sup
,
tt
, var
elements are used for instances. Those that are only there in
compatibility mode are there because either they should not
semantically be used for an instance, or because they are not
present in HTML. An
instance is only used if it does not have an interactive
content or dfn
element as either a parent or a child.
Both definitions and instances are normalized as follows:
--w3c-compat-xref-normalization
is enabled,
all characters apart from U+0020 SPACE CHARACTER, U+002D HYPHEN-MINUS (-),
U+0030 DIGIT ZERO to U+0039 DIGIT NINE (0–9), U+0041 LATIN CAPITAL LETTER A
to U+005A LATIN CAPITAL LETTER Z (A–Z), and U+0061 LATIN SMALL LETTER A to
U+007A LATIN SMALL LETTER Z (a–z) are removed.
If the instance is contained within a span
element, the span
element is turned into an a
element, and a href
attribute is added to link it to the
definition (e.g., <span>foo</span>
becomes
<a href=#foo>foo</a>
) — all other attributes are
preserved. Otherwise (when the instance is not contained within a
span
element), the location of the a
element when
linking an instance is dependent on the
--w3c-compat-xref-a-placement
option: if it is disabled
(the default), the a
element is placed around the element
containing the instance (e.g., <i>foo</i>
becomes <a href=#foo><i>foo</i></a>
); if it is
enabled, the a
element goes within the element containing the
instance and goes around all of its content (e.g.,
<i>foo</i>
becomes <i><a
href=#foo>foo</a></i>
).
To create a table of contents, and to number the sections of the document, an
outline is created (this is a list of sections, which can each contain more sections, where a section
represents a part of the document, and often has a heading associated with it — for more
detailed definitions see HTML). This
means not only are the h1
–h6
elements supported, but
also elements such as section
are used to
create the outline. After creating the outline, every
section with a depth between those provided by
--min-depth
and --max-depth
(defaulting to two and six respectively), and which has a heading, is numbered if it does not
have no-num
as a class, and is added to the table of contents if
it does not have no-toc
as a class. Sections without a heading are treated as if they did
not exist, unless they have children, in which they will appear to exist while
not existing all at once (e.g., they increment the section
numbering, though that is not output anywhere; and they get a list item in the
table of contents, with only the children within it, and no link to the
section itself).
The format of section numbers should comply with ISO 2145:1978, Numbering of divisions and subdivisions in written documents. This means that each section number is given by Arabic numerals, seperated by a single U+002E FULL STOP character, and there is no trailing U+002E FULL STOP character.
The section number is inserted as the first child node of the
section heading as a span
element with the
class
attribute set to secno
: this is copied into the
table of contents.
Pre-existing span
elements with a class of secno
are removed from all section
headings, regardless of whether their depth falls within the range given
by --min-depth
and --max-depth
.
The table of contents is built up as an ordered list (an ol
element), with each section marked up as a li
element, and child sections are marked
up with an ol
within that li
(and this continues
recursively, ad infinitum). By default, the root element of the table of
contents (an ol
element) is given a class
attribute
set to toc
; however, with the
--w3c-compat-class-toc
option this is placed on every
ol
within the table of contents. The entire section
heading is copied to be the content of the list item, with all
interactive content elements and id
attributes
removed.
A normal comment substitution is done with
sub_identifier equal to toc
, and the table of contents
as the replacement.
Various strings are replaced in magic ways: a normal string
substitution takes the form of [xxx]
where xxx is
case-sensitively the replacement, which may be followed by any characters apart
from U+005D RIGHT SQUARE BRACKET (]) before the final U+005D RIGHT SQUARE
BRACKET character — these extra characters are effectively a comment, and
carry absolutely no meaning, and vanish into some as-of-yet unknown abyss when
the string replacement is done. The entire string must be contained within a
single text node.
A normal comment substitution is one where there is a string,
sub_identifier, that identifies the comment for the substitution,
and the replacement. All nodes between a comment with a value equal to (with
leading and trailing ASCII whitespace
removed) begin-
followed by sub_identifier and one with
q value equal to (with leading and trailing
ASCII whitespace removed)
end-
followed by sub_identifier are removed, and
replaced with the replacement. Additionally, any comment (with leading and
trailing ASCII whitespace removed) with
a value equal to sub_identifier is replaced with a comment with a
value of begin-
followed by sub_identifier, the
replacement, and then a comment with a value of end-
followed by
sub_identifier.
The W3C status is given by the
--w3c-status=status
argument.
If that argument isn't given, it is found, when needed by one of the substitutions, by iterating all text nodes in document order (i.e., attribute values and comments have no effect), and for each node, the following is done (in this order):
A side-effect of doing it in this order is the fact that if a node contains both of these possible strings is that the latter is ignored, meaning that the default (ED) is used.
There is also a long W3C status, which correlates to the W3C status under the following mapping:
W3C Status | Long W3C Status |
---|---|
MO | W3C Member-only Draft |
ED | Editor's Draft |
WD | W3C Working Draft |
CR | W3C Candidate Recommendation |
PR | W3C Proposed Recommendation |
REC | W3C Recommendation |
PER | W3C Proposed Edited Recommendation |
NOTE | W3C Working Group Note |
By default, the normal string substitutions are:
[DATE]
[CDATE]
[YEAR]
[TITLE]
title
element which is within the first head
of the
document, or an empty string if such a title
element does not
exist.
There is one comment substitution by default. Any nodes between a comment
with a value equal to (with leading and trailing
ASCII whitespace removed)
begin-link
and one with a value equal to end-link
,
with interactive content elements removed (though children of
those elements preserved), are effectively wrapped in an a
element
which has a href
attribute equal to the
textContent
of all the nodes between the
two comments concatenated in document order. The two comments must have the
same parent, otherwise a fatal error occurs.
If --w3c-compat-substitutions
is enabled, the
following normal string
substitutions are done in addition to those above:
[STATUS]
[LONGSTATUS]
Additionally, the following normal comment substitutions are done:
logo
<p><a
href="http://www.w3.org/"><img alt="W3C"
src="http://www.w3.org/Icons/w3c_home"/></a></p>
(parsed as an XML
fragment, and serialized into the output document in the needed format).
copyright
<p class="copyright"><a
href="http://www.w3.org/Consortium/Legal/ipr-notice#Copyright">Copyright</a>
© [YEAR] <a href="http://www.w3.org/"><abbr title="World
Wide Web Consortium">W3C</abbr></a><sup>®</sup> (<a
href="http://www.csail.mit.edu/"><abbr title="Massachusetts Institute of
Technology">MIT</abbr></a>, <a
href="http://www.ercim.org/"><abbr title="European Research Consortium
for Informatics and Mathematics">ERCIM</abbr></a>, <a
href="http://www.keio.ac.jp/">Keio</a>, <a
href="http://ev.buaa.edu.cn/">Beihang</a>), All Rights Reserved. W3C <a
href="http://www.w3.org/Consortium/Legal/ipr-notice#Legal_Disclaimer">liability</a>,
<a
href="http://www.w3.org/Consortium/Legal/ipr-notice#W3C_Trademarks">trademark</a>
and <a
href="http://www.w3.org/Consortium/Legal/copyright-documents">document
use</a> rules apply.</p>
(parsed as an XML fragment, and
serialized into the output document in the needed format).
There is one further string substitution, and this is only done when
--w3c-compat-crazy-substitutions
is enabled (note that
this is not included in --w3c-compat
). A string of
http://www.w3.org/StyleSheets/TR/W3C-
followed by one or more
characters in the range U+0041 LATIN CAPITAL LETTER A to U+005A LATIN CAPITAL
LETTER Z (A–Z) is replaced with whatever
http://www.w3.org/StyleSheets/TR/W3C-[STATUS]
would evaluate to
be. Like the normal string
substitutions, this string must be contained in a single text node.
A "references" section can be generated automagically by enabling the
refs
process. There are two kinds of elements relevant
to this process:
reference instances and
reference sections.
Reference data is also required.
Reference data is to be given as a
references.json
file, in the data
folder.
The expected data structure is a set of name-value pairs, with a pair for each
possible reference. The name is referred to as the
reference abbreviation, and the value is to be a set with the
following name-value pairs:
title
:
the title of the referenced document;
href
:
the URL of the referenced document;
authors
:
an array of authors or editors of the referenced document;
publisher
:
the organisation publishing the referenced document (optional);
isbn
:
for dead tree publications, their ISBN (optional).
Reference instances are
defined by span
elements with a
data-anolis-ref
attribute. Their
textContent must be the
reference abbreviation of a reference defined in the
reference data. A reference instance is either
informative or normative: it is informative if the element
has an informative
class and normative otherwise.
Reference sections are
div
elements, wherein the reference lists will be
constructed.
In compatibility mode, two distinct
reference sections are used:
one with anolis-references-normative
as its id, for those
references with at least one normative
instance, and one with
anolis-references-informative
as its id, for the others.
When Anolis is not in compatibility mode, just one
reference section is used, with an
anolis-references
id. In this case, the references for which
all instances are
informative are prefixed with "(Non-normative) ".
The xspecxref
process makes it possible to reference
terms defined in other documents, with
or without the the cooperation of its author. This
process works in a way very similar to
the xref
process.
The essential parts of this process
are instances of
terms, which are defined in
documents. These
documents are referenced by their
respective shortnames. The elements
used to mark up the instances must
have a data-anolis-spec
attribute whose value is a
shortnames.
In order to generate links to the other
document, a JSON file with the
list of defined terms for each
document must be supplied in the
xrefs
subfolder of the data
folder, which
must be listed in the specs.json
. This file must contain
a set of name-value mappings where the name is a
shortname and the value is the URL
of its document's
list of defined terms, relative to the xrefs
folder.
For example, the specs.json
file could look as follows:
{ "webidl": "webidl.json", "domcore": "dom/webdomcore.json", "html": "dom/html5.json", "csscascade": "css/css3cascade.json", "cssbackgrounds": "css/css4backgrounds.json" }
Each list of defined terms must contain a set of name-value pairs, with the following names:
url
:
the URL of the document, with
a traling "#
";
definitions
:
a set of name-value mappings where the names are the
terms in the
document,
converted to lowercase, and the
values are the
IDs of
the defining element in the
document.
For example, the webdomcore.json
file could look as
follows:
{ "url": "http://dvcs.w3.org/hg/domcore/raw-file/tip/Overview.html#", "definitions": { "converted to lowercase": "converted-to-lowercase", "concept-id": "concept-id" } }
In order to create such a file, the --dump-xrefs
option can be used. Calling Anolis with the
--dump-xrefs=path
argument will update the
list of defined terms at path with the
terms currently defined in the
input document.
Before the first use, a JSON file must be provided at path,
with the url
property already
present.
Thanks to Andrew Sidwell, Anne van Kesteren, Henri Sivonen, Ian Hickson, James Graham, Lachlan Hunt, Magnus Kristiansen, Michael(tm) Smith, and Philip Taylor for their ever needed help.
Special thanks to Geoffrey Sneddon, who created this tool.
Special thanks to Bert Bos for creating the CSS3 Module Postprocessor, on
which this is partially based, and (with --w3c-compat
) claims to
be partially compatible with. Further special thanks to Bert Bos for creating a
number of things (especially the algorithm for finding the W3C
status) that took the author of Anolis many hours to reverse
engineer.