tacl package¶
Subpackages¶
Submodules¶
tacl.catalogue module¶
-
class
tacl.catalogue.
Catalogue
[source]¶ Bases:
dict
-
generate
(path, label)[source]¶ Creates default data from the corpus at path, marking all works with label.
Parameters: - path (str) – path to a corpus directory
- label (str) – label to categorise each work as
-
labels
¶ Returns the distinct labels defined in the catalogue.
Return type: list
-
tacl.colour module¶
Module containing functions to generate distinct colours.
-
tacl.colour.
generate_colours
(n)[source]¶ Return a list of n distinct colours, each represented as an RGB string suitable for use in CSS.
Based on the code at http://martin.ankerl.com/2009/12/09/how-to-create-random-colors-programmatically/
Parameters: n (int) – number of colours to generate Return type: list of str
-
tacl.colour.
hsv_to_rgb
(h, s, v)[source]¶ Convert a colour specified in HSV (hue, saturation, value) to an RGB string.
Based on the algorithm at https://en.wikipedia.org/wiki/HSL_and_HSV#Converting_to_RGB
Parameters: - h (int) – hue, a value between 0 and 1
- s (int) – saturation, a value between 0 and 1
- v (int) – value, a value between 0 and 1
Return type: str
tacl.constants module¶
Module containing constants.
tacl.corpus module¶
Module containing the Corpus class.
-
class
tacl.corpus.
Corpus
(path, tokenizer)[source]¶ Bases:
object
A Corpus represents a collection of WitnessTexts.
A Corpus is built from a directory that contains the text files that become WitnessText objects.
-
get_sigla
(work)[source]¶ Returns a list of all of the sigla for work.
Parameters: work (str) – name of work Return type: list of str
-
get_witness
(work, siglum, text_class=<class 'tacl.text.WitnessText'>)[source]¶ Returns a WitnessText representing the file associated with work and siglum.
Combined, work and siglum form the basis of a filename for retrieving the text.
Parameters: - work (str) – name of work
- siglum (str) – siglum of witness
Return type: WitnessText
-
tacl.data_store module¶
Module containing the DataStore class.
-
class
tacl.data_store.
DataStore
(db_name, use_memory=True, ram=0)[source]¶ Bases:
object
Class representing the data store for text data.
It provides an interface to the underlying database, with methods to add and query data.
-
add_ngrams
(corpus, minimum, maximum, catalogue=None)[source]¶ Adds n-gram data from corpus to the data store.
Parameters: - corpus (Corpus) – corpus of works
- minimum (int) – minimum n-gram size
- maximum (int) – maximum n-gram size
- catalogue (Catalogue) – optional catalogue to limit corpus to
-
counts
(catalogue, output_fh)[source]¶ Returns output_fh populated with CSV results giving n-gram counts of the witnesses of the works in catalogue.
Parameters: - catalogue (Catalogue) – catalogue matching filenames to labels
- output_fh (file-like object) – object to output results to
Return type: file-like object
-
diff
(catalogue, tokenizer, output_fh)[source]¶ Returns output_fh populated with CSV results giving the n-grams that are unique to the witnesses of each labelled set of works in catalogue.
Note that this is not the same as the symmetric difference of these sets, except in the case where there are only two labels.
Parameters: - catalogue (Catalogue) – catalogue matching filenames to labels
- tokenizer (Tokenizer) – tokenizer for the n-grams
- output_fh (file-like object) – object to output results to
Return type: file-like object
-
diff_asymmetric
(catalogue, prime_label, tokenizer, output_fh)[source]¶ Returns output_fh populated with CSV results giving the difference in n-grams between the witnesses of labelled sets of works in catalogue, limited to those works labelled with prime_label.
Parameters: - catalogue (Catalogue) – catalogue matching filenames to labels
- prime_label (str) – label to limit results to
- tokenizer (Tokenizer) – tokenizer for the n-grams
- output_fh (file-like object) – object to output results to
Return type: file-like object
-
diff_supplied
(results_filenames, labels, tokenizer, output_fh)[source]¶ Returns output_fh populated with CSV results giving the n-grams that are unique to the witnesses in each set of works in results_sets, using the labels in labels.
Note that this is not the same as the symmetric difference of these sets, except in the case where there are only two labels.
Parameters: - results_filenames (list of str) – list of results filenames to be diffed
- labels (list) – labels to be applied to the results_sets
- tokenizer (Tokenizer) – tokenizer for the n-grams
- output_fh (file-like object) – object to output results to
Return type: file-like object
-
intersection
(catalogue, output_fh)[source]¶ Returns output_fh populated with CSV results giving the intersection in n-grams of the witnesses of labelled sets of works in catalogue.
Parameters: - catalogue (Catalogue) – catalogue matching filenames to labels
- output_fh (file-like object) – object to output results to
Return type: file-like object
-
intersection_supplied
(results_filenames, labels, output_fh)[source]¶ Returns output_fh populated with CSV results giving the n-grams that are common to witnesses in every set of works in results_sets, using the labels in labels.
Parameters: - results_filenames (list of str) – list of results to be diffed
- labels (list) – labels to be applied to the results_sets
- output_fh (file-like object) – object to output results to
Return type: file-like object
-
search
(catalogue, ngrams, output_fh)[source]¶ Returns output_fh populated with CSV results for each witness that contains at least one of the n-grams in ngrams.
Parameters: - catalogue (Catalogue) –
- ngrams (list) – n-grams to search for
- output_fh (file-like object) – object to write results to
Return type: file-like object
-
tacl.exceptions module¶
-
exception
tacl.exceptions.
MalformedCatalogueError
(msg)[source]¶ Bases:
tacl.exceptions.TACLError
-
exception
tacl.exceptions.
MalformedQueryError
(msg)[source]¶ Bases:
tacl.exceptions.TACLError
tacl.highlighter module¶
Module containing the Highlighter class.
-
class
tacl.highlighter.
HighlightReport
(corpus, tokenizer)[source]¶ Bases:
tacl.report.Report
-
class
tacl.highlighter.
NgramHighlightReport
(corpus, tokenizer)[source]¶ Bases:
tacl.highlighter.HighlightReport
-
generate
(output_dir, work, ngrams, labels, minus_ngrams)[source]¶ Generates HTML reports for each witness to work, showing its text with the n-grams in ngrams highlighted.
Any n-grams in minus_ngrams have any highlighting of them (or subsets of them) removed.
Parameters: - output_dir (str) – directory to write report to
- work (str) – name of work to highlight
- ngrams (list of list of str) – groups of n-grams to highlight
- labels (list of str) – labels for the groups of n-grams
- minus_ngrams (list of str) – n-grams to remove highlighting from
Return type: str
-
-
class
tacl.highlighter.
ResultsHighlightReport
(corpus, tokenizer)[source]¶ Bases:
tacl.highlighter.HighlightReport
-
generate
(output_dir, work, matches_filename)[source]¶ Generates HTML reports showing the text of each witness to work with its matches in matches highlighted.
Parameters: - output_dir (str) – directory to write report to
- work – name of work to highlight
- matches_filename (str) – file containing matches to highlight
Return type: str
-
tacl.jitc module¶
-
class
tacl.jitc.
JitCReport
(store, corpus, tokenizer)[source]¶ Bases:
tacl.report.Report
Generate statistics to list works from one corpus (referred to below as “Maybe” and defined in a catalogue file) in order of similarity to each work in that corpus. Takes into account a second corpus of works (referred to below as “No” and defined in the same catalogue file) that are similar to those in the first, but not in the way(s) that are the subject of the investigation.
Given the two corpora, Maybe and No, the script performs the following actions:
- For each work Y in Maybe:
- Run an intersection between Y and No.
- For each work M in Maybe (excluding Y):
- Run an intersect between Y and M.
- Run a supplied diff between results from [1.2.1] and results from [1.1].
- Get number of tokens in M.
- Rank and list works in Maybe in descending order of the ratio, from [1.2.2], of matching tokens (n-gram size x count) to total tokens [1.2.3].
- Concatenate all results from [1.2.2] and present them in an HTML report.
Note that in the above, when a work is treated as Y, its witnesses are not treated separately. The statistics derived from queries including it are those that treat all of its witnesses together; eg, if two n-grams in a witness of M are found only in two different witnesses of Y, they will both be counted as shared.
tacl.report module¶
-
class
tacl.report.
Report
[source]¶ Bases:
object
Base class for HTML reports.
Subclasses should implement the generate method, used to generate and output the report into the supplied directory. The calling code is responsible for ensuring that the directory exists and is writable.
The intention is that the same Report object can be used to produce multiple reports in a given context, such as a single corpus. The contextual data should be supplied to the __init__ method, and the data for a single specific report passed to the generate method.
tacl.results module¶
Module containing the Results class.
-
class
tacl.results.
Results
(matches, tokenizer)[source]¶ Bases:
object
Class representing a set of n-gram results.
Provides methods for manipulating those results in ways that maintain the same structure (CSV, same field names). Those methods that modify the fields do so in relatively minor ways that often allow for the other methods to still operate on the results.
A method’s modifications to the field names, if any, are specified in that method’s docstring.
-
add_label_count
()[source]¶ Adds to each result row a count of the number occurrences of that n-gram across all works within the label.
This count uses the highest witness count for each work.
-
bifurcated_extend
(corpus, max_size)[source]¶ Replaces the results with those n-grams that contain any of the original n-grams, and that represent points at which an n-gram is a constituent of multiple larger n-grams with a lower label count.
Parameters: - corpus (Corpus) – corpus of works to which results belong
- max_size (int) – maximum size of n-gram results to include
-
collapse_witnesses
()[source]¶ Groups together witnesses for the same n-gram and work that has the same count, and outputs a single row for each group.
This output replaces the siglum field with a sigla field that provides a space-separated list of the witness sigla. Due to this, it is not necessarily possible to run other Results methods on results that have had their witnesses collapsed.
-
csv
(fh)[source]¶ Writes the report data to fh in CSV format and returns it.
Parameters: fh (file object) – file to write data to Return type: file object
-
extend
(corpus)[source]¶ Adds rows for all longer forms of n-grams in the results that are present in the witnesses.
This works with both diff and intersect results.
Parameters: corpus (Corpus) – corpus of works to which results belong
-
prune_by_ngram
(ngrams)[source]¶ Removes results rows whose n-gram is in ngrams.
Parameters: ngrams (list of str) – n-grams to remove
-
prune_by_ngram_count
(minimum=None, maximum=None)[source]¶ Removes results rows whose total n-gram count (across all works bearing this n-gram) is outside the range specified by minimum and maximum.
For each text, the count used as part of the sum across all works is the maximum count across the witnesses for that work.
Parameters: - minimum (int) – minimum n-gram count
- maximum (int) – maximum n-gram count
-
prune_by_ngram_count_per_work
(minimum=None, maximum=None)[source]¶ Removes results rows if the n-gram count for all works bearing that n-gram is outside the range specified by minimum and maximum.
That is, if a single witness of a single work has an n-gram count that falls within the specified range, all result rows for that n-gram are kept.
Parameters: - minimum (int) – minimum n-gram count
- maximum (int) – maximum n-gram count
-
prune_by_ngram_size
(minimum=None, maximum=None)[source]¶ Removes results rows whose n-gram size is outside the range specified by minimum and maximum.
Parameters: - minimum (int) – minimum n-gram size
- maximum (int) – maximum n-gram size
-
prune_by_work_count
(minimum=None, maximum=None)[source]¶ Removes results rows for n-grams that are not attested in a number of works in the range specified by minimum and maximum.
Work here encompasses all witnesses, so that the same n-gram appearing in multiple witnesses of the same work are counted as a single work.
Parameters: - minimum (int) – minimum number of works
- maximum (int) – maximum number of works
-
reciprocal_remove
()[source]¶ Removes results rows for which the n-gram is not present in at least one text in each labelled set of texts.
-
remove_label
(label)[source]¶ Removes all results rows associated with label.
Parameters: label (str) – label to filter results on
-
sort
()[source]¶ Sorts all results rows.
Sorts by: size (descending), n-gram, count (descending), label, text name, siglum.
-
zero_fill
(corpus, catalogue)[source]¶ Adds rows to the results to ensure that, for every n-gram that is attested in at least one witness, every witness for that text has a row, with added rows having a count of zero.
Parameters: - corpus (Corpus) – corpus containing the texts appearing in the results
- catalogue (Catalogue) – catalogue used in the generation of the results
-
tacl.sequence module¶
Module containing the Sequence and SequenceReport classes.
-
class
tacl.sequence.
Sequence
(alignment, substitutes, start_index)[source]¶ Bases:
object
Class to format supplied sequences using simple HTML span markup.
-
start_index
¶
-
-
class
tacl.sequence.
SequenceReport
(corpus, tokenizer, results)[source]¶ Bases:
tacl.report.Report
tacl.statistics_report module¶
Module containing the StatisticsReport class.
tacl.stripper module¶
Module containing the Stripper class.
-
class
tacl.stripper.
Stripper
(input_dir, output_dir)[source]¶ Bases:
object
Class used for preprocessing a corpus of texts by stripping out all material that is not the textual material proper, and generating plain text witness files for each witness attested.
The intention is to keep the stripped text as close in formatting to the original as possible, including whitespace.
tacl.tei_corpus module¶
Module containing the TEICorpus class.
-
class
tacl.tei_corpus.
TEICorpus
(input_dir, output_dir)[source]¶ Bases:
object
A TEICorpus represents a collection of TEI XML documents.
The CBETA works are TEI XML that have certain quirks that make them difficult to use directly in TACL’s stripping process. This class provides a tidy method to deal with these quirks; in particular it consolidates multiple XML files for a single work into one XML file.
This class must not be instantiated directly; rather a subclass appropriate to the source should be used.
-
get_witnesses
(source_tree)[source]¶ Returns a sorted list of all witnesses of variant readings in source_tree, and the elements that bear @wit attributes.
Parameters: source_tree (etree._ElementTree) – XML tree of source document Return type: tuple of lists
-
xslt
= ''¶
-
-
class
tacl.tei_corpus.
TEICorpusCBETA2011
(input_dir, output_dir)[source]¶ Bases:
tacl.tei_corpus.TEICorpus
A TEICorpus subclass where the source files are formatted as per the CBETA 2011 DVD release (TEI P4).
-
work_pattern
= re.compile('^(?P<prefix>[A-Z]{1,2})\\d+n(?P<work>[^_\\.]+)_(?P<part>\\d+)$')¶
-
xslt
= 'prepare_tei_cbeta_2011.xsl'¶
-
-
class
tacl.tei_corpus.
TEICorpusCBETAGitHub
(input_dir, output_dir)[source]¶ Bases:
tacl.tei_corpus.TEICorpus
A TEICorpus subclass where the source files are formatted as per the CBETA GitHub repository at https://github.com/cbeta-org/xml-p5.git (TEI P5)
-
get_resps
(source_tree)[source]¶ Returns a sorted list of all resps in source_tree, and the elements that bear @resp attributes.
Parameters: source_tree (etree._ElementTree) – XML tree of source document Return type: tuple of lists
-
work_pattern
= re.compile('^(?P<prefix>[A-Z]{1,2})\\d+n(?P<work>[A-Z]?\\d+)(?P<part>[A-Za-z]?)$')¶
-
xslt
= 'prepare_tei_cbeta_github.xsl'¶
-
tacl.text module¶
Module containing the Text and WitnessText classes.
-
class
tacl.text.
FilteredWitnessText
(name, siglum, content, tokenizer)[source]¶ Bases:
tacl.text.WitnessText
Class for the text of a witness that supplies only those n-grams that contain a supplied list of n-grams.
-
static
get_filter_ngrams_pattern
(filter_ngrams)[source]¶ Returns a compiled regular expression matching on any of the n-grams in filter_ngrams.
Parameters: filter_ngrams (list of str) – n-grams to use in regular expression Return type: _sre.SRE_Pattern
-
get_ngrams
(minimum, maximum, filter_ngrams)[source]¶ Returns a generator supplying the n-grams (minimum <= n <= maximum) for this text.
Each iteration of the generator supplies a tuple consisting of the size of the n-grams and a collections.Counter of the n-grams.
Parameters: - minimum (int) – minimum n-gram size
- maximum (int) – maximum n-gram size
- filter_ngrams (list) – n-grams that must be contained by the generated n-grams
Return type: generator
-
static
-
class
tacl.text.
Text
(content, tokenizer)[source]¶ Bases:
object
Class for base text functionality (getting tokens, generating n-grams).
Used for (snippets of) texts that are not witnesses.
-
get_ngrams
(minimum, maximum, skip_sizes=None)[source]¶ Returns a generator supplying the n-grams (minimum <= n <= maximum) for this text.
Each iteration of the generator supplies a tuple consisting of the size of the n-grams and a collections.Counter of the n-grams.
Parameters: - minimum (int) – minimum n-gram size
- maximum (int) – maximum n-gram size
Return type: generator
-
-
class
tacl.text.
WitnessText
(name, siglum, content, tokenizer)[source]¶ Bases:
tacl.text.Text
Class for the text of a witness. A witness has a work name and a siglum, and has a corresponding filename.
tacl.tokenizer module¶
Module containing the Tokenizer class.
-
class
tacl.tokenizer.
Tokenizer
(pattern, joiner, flags=56)[source]¶ Bases:
object
A tokenizer that splits a string using a regular expression.
Based on the RegexpTokenizer from the Natural Language Toolkit.
-
joiner
¶ The string used to join tokens together when reconstructing a text.
-
pattern
¶ The regular expression pattern used to divide the text into tokens.
-