tacl package

Submodules

tacl.catalogue module

class tacl.catalogue.Catalogue[source]

Bases: dict

generate(path, label)[source]

Creates default data from the corpus at path, marking all works with label.

Parameters:
  • path (str) – path to a corpus directory
  • label (str) – label to categorise each work as
labels

Returns the distinct labels defined in the catalogue.

Return type:list
load(path)[source]

Loads the data from path into the catalogue.

Parameters:path (str) – path to catalogue file
save(path)[source]

Saves this catalogue’s data to path.

Parameters:path (str) – file path to save catalogue data to

tacl.colour module

Module containing functions to generate distinct colours.

tacl.colour.generate_colours(n)[source]

Return a list of n distinct colours, each represented as an RGB string suitable for use in CSS.

Based on the code at http://martin.ankerl.com/2009/12/09/how-to-create-random-colors-programmatically/

Parameters:n (int) – number of colours to generate
Return type:list of str
tacl.colour.hsv_to_rgb(h, s, v)[source]

Convert a colour specified in HSV (hue, saturation, value) to an RGB string.

Based on the algorithm at https://en.wikipedia.org/wiki/HSL_and_HSV#Converting_to_RGB

Parameters:
  • h (int) – hue, a value between 0 and 1
  • s (int) – saturation, a value between 0 and 1
  • v (int) – value, a value between 0 and 1
Return type:

str

tacl.constants module

Module containing constants.

tacl.corpus module

Module containing the Corpus class.

class tacl.corpus.Corpus(path, tokenizer)[source]

Bases: object

A Corpus represents a collection of WitnessTexts.

A Corpus is built from a directory that contains the text files that become WitnessText objects.

get_sigla(work)[source]

Returns a list of all of the sigla for work.

Parameters:work (str) – name of work
Return type:list of str
get_witness(work, siglum, text_class=<class 'tacl.text.WitnessText'>)[source]

Returns a WitnessText representing the file associated with work and siglum.

Combined, work and siglum form the basis of a filename for retrieving the text.

Parameters:
  • work (str) – name of work
  • siglum (str) – siglum of witness
Return type:

WitnessText

get_witnesses(name='*')[source]

Returns a generator supplying WitnessText objects for each file in the corpus.

Return type:generator of WitnessText

tacl.data_store module

Module containing the DataStore class.

class tacl.data_store.DataStore(db_name, use_memory=True, ram=0)[source]

Bases: object

Class representing the data store for text data.

It provides an interface to the underlying database, with methods to add and query data.

add_ngrams(corpus, minimum, maximum, catalogue=None)[source]

Adds n-gram data from corpus to the data store.

Parameters:
  • corpus (Corpus) – corpus of works
  • minimum (int) – minimum n-gram size
  • maximum (int) – maximum n-gram size
  • catalogue (Catalogue) – optional catalogue to limit corpus to
counts(catalogue, output_fh)[source]

Returns output_fh populated with CSV results giving n-gram counts of the witnesses of the works in catalogue.

Parameters:
  • catalogue (Catalogue) – catalogue matching filenames to labels
  • output_fh (file-like object) – object to output results to
Return type:

file-like object

diff(catalogue, tokenizer, output_fh)[source]

Returns output_fh populated with CSV results giving the n-grams that are unique to the witnesses of each labelled set of works in catalogue.

Note that this is not the same as the symmetric difference of these sets, except in the case where there are only two labels.

Parameters:
  • catalogue (Catalogue) – catalogue matching filenames to labels
  • tokenizer (Tokenizer) – tokenizer for the n-grams
  • output_fh (file-like object) – object to output results to
Return type:

file-like object

diff_asymmetric(catalogue, prime_label, tokenizer, output_fh)[source]

Returns output_fh populated with CSV results giving the difference in n-grams between the witnesses of labelled sets of works in catalogue, limited to those works labelled with prime_label.

Parameters:
  • catalogue (Catalogue) – catalogue matching filenames to labels
  • prime_label (str) – label to limit results to
  • tokenizer (Tokenizer) – tokenizer for the n-grams
  • output_fh (file-like object) – object to output results to
Return type:

file-like object

diff_supplied(results_filenames, labels, tokenizer, output_fh)[source]

Returns output_fh populated with CSV results giving the n-grams that are unique to the witnesses in each set of works in results_sets, using the labels in labels.

Note that this is not the same as the symmetric difference of these sets, except in the case where there are only two labels.

Parameters:
  • results_filenames (list of str) – list of results filenames to be diffed
  • labels (list) – labels to be applied to the results_sets
  • tokenizer (Tokenizer) – tokenizer for the n-grams
  • output_fh (file-like object) – object to output results to
Return type:

file-like object

intersection(catalogue, output_fh)[source]

Returns output_fh populated with CSV results giving the intersection in n-grams of the witnesses of labelled sets of works in catalogue.

Parameters:
  • catalogue (Catalogue) – catalogue matching filenames to labels
  • output_fh (file-like object) – object to output results to
Return type:

file-like object

intersection_supplied(results_filenames, labels, output_fh)[source]

Returns output_fh populated with CSV results giving the n-grams that are common to witnesses in every set of works in results_sets, using the labels in labels.

Parameters:
  • results_filenames (list of str) – list of results to be diffed
  • labels (list) – labels to be applied to the results_sets
  • output_fh (file-like object) – object to output results to
Return type:

file-like object

search(catalogue, ngrams, output_fh)[source]

Returns output_fh populated with CSV results for each witness that contains at least one of the n-grams in ngrams.

Parameters:
  • catalogue (Catalogue) –
  • ngrams (list) – n-grams to search for
  • output_fh (file-like object) – object to write results to
Return type:

file-like object

validate(corpus, catalogue)[source]

Returns True if all of the files labelled in catalogue are up-to-date in the database.

Parameters:
  • corpus (Corpus) – corpus of works
  • catalogue (Catalogue) – catalogue matching filenames to labels
Return type:

bool

tacl.exceptions module

exception tacl.exceptions.MalformedCatalogueError(msg)[source]

Bases: tacl.exceptions.TACLError

exception tacl.exceptions.MalformedQueryError(msg)[source]

Bases: tacl.exceptions.TACLError

exception tacl.exceptions.TACLError(msg)[source]

Bases: Exception

tacl.highlighter module

Module containing the Highlighter class.

class tacl.highlighter.HighlightReport(corpus, tokenizer)[source]

Bases: tacl.report.Report

generate(output_dir, work, *args)[source]
class tacl.highlighter.NgramHighlightReport(corpus, tokenizer)[source]

Bases: tacl.highlighter.HighlightReport

generate(output_dir, work, ngrams, labels, minus_ngrams)[source]

Generates HTML reports for each witness to work, showing its text with the n-grams in ngrams highlighted.

Any n-grams in minus_ngrams have any highlighting of them (or subsets of them) removed.

Parameters:
  • output_dir (str) – directory to write report to
  • work (str) – name of work to highlight
  • ngrams (list of list of str) – groups of n-grams to highlight
  • labels (list of str) – labels for the groups of n-grams
  • minus_ngrams (list of str) – n-grams to remove highlighting from
Return type:

str

class tacl.highlighter.ResultsHighlightReport(corpus, tokenizer)[source]

Bases: tacl.highlighter.HighlightReport

generate(output_dir, work, matches_filename)[source]

Generates HTML reports showing the text of each witness to work with its matches in matches highlighted.

Parameters:
  • output_dir (str) – directory to write report to
  • work – name of work to highlight
  • matches_filename (str) – file containing matches to highlight
Return type:

str

tacl.jitc module

class tacl.jitc.JitCReport(store, corpus, tokenizer)[source]

Bases: tacl.report.Report

Generate statistics to list works from one corpus (referred to below as “Maybe” and defined in a catalogue file) in order of similarity to each work in that corpus. Takes into account a second corpus of works (referred to below as “No” and defined in the same catalogue file) that are similar to those in the first, but not in the way(s) that are the subject of the investigation.

Given the two corpora, Maybe and No, the script performs the following actions:

  1. For each work Y in Maybe:
  1. Run an intersection between Y and No.
  2. For each work M in Maybe (excluding Y):
  1. Run an intersect between Y and M.
  2. Run a supplied diff between results from [1.2.1] and results from [1.1].
  3. Get number of tokens in M.
  1. Rank and list works in Maybe in descending order of the ratio, from [1.2.2], of matching tokens (n-gram size x count) to total tokens [1.2.3].
  1. Concatenate all results from [1.2.2] and present them in an HTML report.

Note that in the above, when a work is treated as Y, its witnesses are not treated separately. The statistics derived from queries including it are those that treat all of its witnesses together; eg, if two n-grams in a witness of M are found only in two different witnesses of Y, they will both be counted as shared.

generate(output_dir, catalogue, maybe_label)[source]

tacl.report module

class tacl.report.Report[source]

Bases: object

Base class for HTML reports.

Subclasses should implement the generate method, used to generate and output the report into the supplied directory. The calling code is responsible for ensuring that the directory exists and is writable.

The intention is that the same Report object can be used to produce multiple reports in a given context, such as a single corpus. The contextual data should be supplied to the __init__ method, and the data for a single specific report passed to the generate method.

generate(output_dir)[source]

Generate the report, writing it to output_dir.

tacl.results module

Module containing the Results class.

class tacl.results.Results(matches, tokenizer)[source]

Bases: object

Class representing a set of n-gram results.

Provides methods for manipulating those results in ways that maintain the same structure (CSV, same field names). Those methods that modify the fields do so in relatively minor ways that often allow for the other methods to still operate on the results.

A method’s modifications to the field names, if any, are specified in that method’s docstring.

add_label_count()[source]

Adds to each result row a count of the number occurrences of that n-gram across all works within the label.

This count uses the highest witness count for each work.

bifurcated_extend(corpus, max_size)[source]

Replaces the results with those n-grams that contain any of the original n-grams, and that represent points at which an n-gram is a constituent of multiple larger n-grams with a lower label count.

Parameters:
  • corpus (Corpus) – corpus of works to which results belong
  • max_size (int) – maximum size of n-gram results to include
collapse_witnesses()[source]

Groups together witnesses for the same n-gram and work that has the same count, and outputs a single row for each group.

This output replaces the siglum field with a sigla field that provides a space-separated list of the witness sigla. Due to this, it is not necessarily possible to run other Results methods on results that have had their witnesses collapsed.

csv(fh)[source]

Writes the report data to fh in CSV format and returns it.

Parameters:fh (file object) – file to write data to
Return type:file object
extend(corpus)[source]

Adds rows for all longer forms of n-grams in the results that are present in the witnesses.

This works with both diff and intersect results.

Parameters:corpus (Corpus) – corpus of works to which results belong
prune_by_ngram(ngrams)[source]

Removes results rows whose n-gram is in ngrams.

Parameters:ngrams (list of str) – n-grams to remove
prune_by_ngram_count(minimum=None, maximum=None)[source]

Removes results rows whose total n-gram count (across all works bearing this n-gram) is outside the range specified by minimum and maximum.

For each text, the count used as part of the sum across all works is the maximum count across the witnesses for that work.

Parameters:
  • minimum (int) – minimum n-gram count
  • maximum (int) – maximum n-gram count
prune_by_ngram_count_per_work(minimum=None, maximum=None)[source]

Removes results rows if the n-gram count for all works bearing that n-gram is outside the range specified by minimum and maximum.

That is, if a single witness of a single work has an n-gram count that falls within the specified range, all result rows for that n-gram are kept.

Parameters:
  • minimum (int) – minimum n-gram count
  • maximum (int) – maximum n-gram count
prune_by_ngram_size(minimum=None, maximum=None)[source]

Removes results rows whose n-gram size is outside the range specified by minimum and maximum.

Parameters:
  • minimum (int) – minimum n-gram size
  • maximum (int) – maximum n-gram size
prune_by_work_count(minimum=None, maximum=None)[source]

Removes results rows for n-grams that are not attested in a number of works in the range specified by minimum and maximum.

Work here encompasses all witnesses, so that the same n-gram appearing in multiple witnesses of the same work are counted as a single work.

Parameters:
  • minimum (int) – minimum number of works
  • maximum (int) – maximum number of works
reciprocal_remove()[source]

Removes results rows for which the n-gram is not present in at least one text in each labelled set of texts.

reduce()[source]

Removes results rows whose n-grams are contained in larger n-grams.

remove_label(label)[source]

Removes all results rows associated with label.

Parameters:label (str) – label to filter results on
sort()[source]

Sorts all results rows.

Sorts by: size (descending), n-gram, count (descending), label, text name, siglum.

zero_fill(corpus, catalogue)[source]

Adds rows to the results to ensure that, for every n-gram that is attested in at least one witness, every witness for that text has a row, with added rows having a count of zero.

Parameters:
  • corpus (Corpus) – corpus containing the texts appearing in the results
  • catalogue (Catalogue) – catalogue used in the generation of the results

tacl.sequence module

Module containing the Sequence and SequenceReport classes.

class tacl.sequence.Sequence(alignment, substitutes, start_index)[source]

Bases: object

Class to format supplied sequences using simple HTML span markup.

render()[source]

Returns a tuple of HTML fragments rendering each element of the sequence.

start_index
class tacl.sequence.SequenceReport(corpus, tokenizer, results)[source]

Bases: tacl.report.Report

generate(output_dir, minimum_size)[source]

Generates sequence reports and writes them to the output directory.

Parameters:
  • output_dir (str) – directory to output reports to
  • minimum_size (int) – minimum size of n-grams to create sequences for

tacl.statistics_report module

Module containing the StatisticsReport class.

class tacl.statistics_report.StatisticsReport(corpus, tokenizer, matches)[source]

Bases: object

csv(fh)[source]

Writes the report data to fh in CSV format and returns it.

Parameters:fh (file object) – file to write data to
Return type:file object
generate_statistics()[source]

Replaces result rows with summary statistics about the results.

These statistics give the filename, total matching tokens, percentage of matching tokens and label for each witness in the results.

tacl.stripper module

Module containing the Stripper class.

class tacl.stripper.Stripper(input_dir, output_dir)[source]

Bases: object

Class used for preprocessing a corpus of texts by stripping out all material that is not the textual material proper, and generating plain text witness files for each witness attested.

The intention is to keep the stripped text as close in formatting to the original as possible, including whitespace.

get_witnesses(source_tree)[source]

Returns a list of all witnesses of variant readings in source_tree along with their XML ids.

Parameters:source_tree (etree._ElementTree) – XML tree of source document
Return type:list of tuple
strip_file(filename)[source]
strip_files()[source]

tacl.tei_corpus module

Module containing the TEICorpus class.

class tacl.tei_corpus.TEICorpus(input_dir, output_dir)[source]

Bases: object

A TEICorpus represents a collection of TEI XML documents.

The CBETA works are TEI XML that have certain quirks that make them difficult to use directly in TACL’s stripping process. This class provides a tidy method to deal with these quirks; in particular it consolidates multiple XML files for a single work into one XML file.

This class must not be instantiated directly; rather a subclass appropriate to the source should be used.

get_witnesses(source_tree)[source]

Returns a sorted list of all witnesses of variant readings in source_tree, and the elements that bear @wit attributes.

Parameters:source_tree (etree._ElementTree) – XML tree of source document
Return type:tuple of lists
tidy()[source]
xslt = ''
class tacl.tei_corpus.TEICorpusCBETA2011(input_dir, output_dir)[source]

Bases: tacl.tei_corpus.TEICorpus

A TEICorpus subclass where the source files are formatted as per the CBETA 2011 DVD release (TEI P4).

work_pattern = re.compile('^(?P<prefix>[A-Z]{1,2})\\d+n(?P<work>[^_\\.]+)_(?P<part>\\d+)$')
xslt = 'prepare_tei_cbeta_2011.xsl'
class tacl.tei_corpus.TEICorpusCBETAGitHub(input_dir, output_dir)[source]

Bases: tacl.tei_corpus.TEICorpus

A TEICorpus subclass where the source files are formatted as per the CBETA GitHub repository at https://github.com/cbeta-org/xml-p5.git (TEI P5)

get_resps(source_tree)[source]

Returns a sorted list of all resps in source_tree, and the elements that bear @resp attributes.

Parameters:source_tree (etree._ElementTree) – XML tree of source document
Return type:tuple of lists
work_pattern = re.compile('^(?P<prefix>[A-Z]{1,2})\\d+n(?P<work>[A-Z]?\\d+)(?P<part>[A-Za-z]?)$')
xslt = 'prepare_tei_cbeta_github.xsl'

tacl.text module

Module containing the Text and WitnessText classes.

class tacl.text.FilteredWitnessText(name, siglum, content, tokenizer)[source]

Bases: tacl.text.WitnessText

Class for the text of a witness that supplies only those n-grams that contain a supplied list of n-grams.

static get_filter_ngrams_pattern(filter_ngrams)[source]

Returns a compiled regular expression matching on any of the n-grams in filter_ngrams.

Parameters:filter_ngrams (list of str) – n-grams to use in regular expression
Return type:_sre.SRE_Pattern
get_ngrams(minimum, maximum, filter_ngrams)[source]

Returns a generator supplying the n-grams (minimum <= n <= maximum) for this text.

Each iteration of the generator supplies a tuple consisting of the size of the n-grams and a collections.Counter of the n-grams.

Parameters:
  • minimum (int) – minimum n-gram size
  • maximum (int) – maximum n-gram size
  • filter_ngrams (list) – n-grams that must be contained by the generated n-grams
Return type:

generator

class tacl.text.Text(content, tokenizer)[source]

Bases: object

Class for base text functionality (getting tokens, generating n-grams).

Used for (snippets of) texts that are not witnesses.

get_content()[source]

Returns the content of this text.

Return type:str
get_ngrams(minimum, maximum, skip_sizes=None)[source]

Returns a generator supplying the n-grams (minimum <= n <= maximum) for this text.

Each iteration of the generator supplies a tuple consisting of the size of the n-grams and a collections.Counter of the n-grams.

Parameters:
  • minimum (int) – minimum n-gram size
  • maximum (int) – maximum n-gram size
Return type:

generator

get_token_content()[source]

Returns a string of the tokens in this text joined using the tokenizer joiner string.

Return type:str
get_tokens()[source]

Returns a list of tokens in this text.

Return type:list of str
class tacl.text.WitnessText(name, siglum, content, tokenizer)[source]

Bases: tacl.text.Text

Class for the text of a witness. A witness has a work name and a siglum, and has a corresponding filename.

static assemble_filename(name, siglum)[source]
get_checksum()[source]

Returns the checksum for the content of this text.

Return type:str
get_filename()[source]

Returns the filename of this text.

Return type:str
get_names()[source]

Returns the name and siglum of this text.

Return type:tuple

tacl.tokenizer module

Module containing the Tokenizer class.

class tacl.tokenizer.Tokenizer(pattern, joiner, flags=56)[source]

Bases: object

A tokenizer that splits a string using a regular expression.

Based on the RegexpTokenizer from the Natural Language Toolkit.

joiner

The string used to join tokens together when reconstructing a text.

pattern

The regular expression pattern used to divide the text into tokens.

tokenize(text)[source]

Returns all tokens in text.

Parameters:text (str) – text to be tokenized
Return type:list of str

Module contents