usage: tacl highlight [-h] [-v] [-m NGRAMS] (-n NGRAMS | -r RESULTS)
[-l LABEL] [-t {cbeta,pagel}]
CORPUS BASE_NAME OUTPUT
Output an HTML report for each witness to a work, showing the text of that
witness with supplied n-grams visually highlighted.
positional arguments:
CORPUS Path to corpus.
BASE_NAME Name of work to display.
OUTPUT Directory to output report to
optional arguments:
-h, --help show this help message and exit
-v, --verbose Display debug information; multiple -v options
increase the verbosity. (default: None)
-m NGRAMS, --minus-ngrams NGRAMS
Path to file containing n-grams (one per line) to
remove highlighting from. This applies only when -n is
used. (default: None)
-n NGRAMS, --ngrams NGRAMS
Path to file containing n-grams (one per line) to
highlight. This option may be specified multiple
times; the n-grams in each file will be displayed in a
distinct colour. (default: None)
-r RESULTS, --results RESULTS
Path to CSV results; creates heatmap highlighting
(default: None)
-l LABEL, --label LABEL
Label used to identify the n-grams from a file
specified by -n/--ngrams. This option may be specified
multiple times, and provided as many times as the
-n/--ngrams option. (default: None)
-t {cbeta,pagel}, --tokenizer {cbeta,pagel}
Type of tokenizer to use. The "cbeta" tokenizer is
suitable for the Chinese CBETA corpus (tokens are
single characters or workaround clusters within square
brackets). The "pagel" tokenizer is for use with the
transliterated Tibetan corpus (tokens are sets of word
characters plus some punctuation used to transliterate
characters). (default: cbeta)
There are two possible outputs available, depending on whether the -n or -r
option is specified.
If n-grams are supplied via the -n/--ngrams option, the resulting HTML reports
show the specified work's witness texts with those n-grams highlighted. Any
n-grams that are specified via the -m/--minus-ngrams option will have had its
constituent tokens unhighlighted. The -n/--ngrams option may be specified
multiple times; each file's n-grams will be highlighted in a distinct colour.
The -l/--labels option can be used with -n/--ngrams in order to provide labels
for groups of n-grams. There must be as many instances of -l/--labels as there
are of -n/--ngrams. The order of the labels matches the order of the n-grams
files.
If results are supplied via the -r/--results option, the resulting HTML
reports contain an interactive heatmap of the results, allowing the user to
select which witness' matches should be highlighted in the text. Multiple
selections are possible, and the colour of the highlight of a token reflects
how many witnesses have matches containing that token.
examples:
tacl highlight -r intersect.csv corpus/stripped/ T0001 report_dir
tacl highlight -n author_markers.csv corpus/stripped/ T0001 report_dir
tacl highlight -n Dhr_markers.csv -n ZQ_markers.csv corpus/stripped/ -l Dharmaraksa -l "Zhi Qian" T0474 report_dir