usage: tacl results [-h] [-v] [-b CORPUS] [--max-be-count COUNT]
[-c CATALOGUE] [-e CORPUS] [--min-count COUNT]
[--max-count COUNT] [--min-count-work COUNT]
[--max-count-work COUNT] [--min-size SIZE]
[--max-size SIZE] [--min-works COUNT] [--max-works COUNT]
[--ngrams NGRAMS] [--reciprocal] [--reduce]
[--remove LABEL] [--sort] [-t {cbeta,pagel}] [-z CORPUS]
RESULTS
Modify a query results file by adding, removing or otherwise manipulating
result rows. Outputs the new set of results.
positional arguments:
RESULTS Path to CSV results; use - for stdin.
optional arguments:
-h, --help show this help message and exit
-v, --verbose Display debug information; multiple -v options
increase the verbosity. (default: None)
-c CATALOGUE, --catalogue CATALOGUE
Path to the catalogue file used to generate the
results (default: None)
-e CORPUS, --extend CORPUS
Extend the results to list the highest size grams that
also count as matches, going beyond the maximum size
recorded in the database. This has no effect if the
results contain only 1-grams. (default: None)
--min-count COUNT Minimum total count per n-gram to include. (default:
None)
--max-count COUNT Maximum total count per n-gram to include. (default:
None)
--min-count-work COUNT
Minimum count per n-gram per work to include; if a
single witness meets this criterion for an n-gram, all
instances of that n-gram are kept. (default: None)
--max-count-work COUNT
Maximum count per n-gram per work to include; if a
single witness meets this criterion for an n-gram, all
instances of that n-gram are kept. (default: None)
--min-size SIZE Minimum size of n-grams to include. (default: None)
--max-size SIZE Maximum size of n-grams to include. (default: None)
--min-works COUNT Minimum count of works containing n-gram to include.
(default: None)
--max-works COUNT Maximum count of works containing n-gram to include.
(default: None)
--ngrams NGRAMS Path to file containing n-grams (one per line) to
exclude. (default: None)
--reciprocal Remove n-grams that are not attested by at least one
work in each labelled set of works. This can be useful
after reducing a set of intersection results.
(default: False)
--reduce Remove n-grams that are contained in larger n-grams.
(default: False)
--remove LABEL Remove labelled results. (default: None)
--sort Sort the results. (default: False)
-t {cbeta,pagel}, --tokenizer {cbeta,pagel}
Type of tokenizer to use. The "cbeta" tokenizer is
suitable for the Chinese CBETA corpus (tokens are
single characters or workaround clusters within square
brackets). The "pagel" tokenizer is for use with the
transliterated Tibetan corpus (tokens are sets of word
characters plus some punctuation used to transliterate
characters). (default: cbeta)
-z CORPUS, --zero-fill CORPUS
Add rows with a count of 0 for each n-gram in each
witness of a work that has at least one witness
bearing that n-gram. The catalogue used to generate
the results must also be specified with the -c option.
(default: None)
bifurcated extend:
-b CORPUS, --bifurcated-extend CORPUS
Extend results to bifurcation points. Generates
results containing those n-grams, derived from the
original n-grams, that have a label count higher than
their containing (n+1)-grams, or that have a label
count of one and the constituent (n-1)-grams have a
higher label count. (default: None)
--max-be-count COUNT Maximum size of n-gram to extend to (default: None)
If more than one modifier is specified, they are applied in the following
order: --extend, --bifurcated-extend, --reduce, --reciprocal, --zero-fill,
--ngrams, --min/max-works, --min/max-size, --min/max-count, --min/max-count-
work, --remove, --sort.
It is important to be careful with the use of --reduce. Coupled with --max-
size, many results may be discarded without trace (since the reduce occurs
first). Note too that performing "reduce" on a set of results more than once
will make the results inaccurate!
--extend applies before --reduce because it may generate results that are also
amenable to reduction.
--extend applies before --remove because it depends on there being at least
two labels in the results in order to give correct results.
--min-count and --max-count set the range within which the total count of each
n-gram, across all works, must fall. For each work, its count is taken as the
highest count among its witnesses.
--min-works and --max-works count works rather than witnesses.
If both --min-count-work and --max-count-work are specified, only those
n-grams are kept that have at least one witness whose count falls within that
range.
Since this command always outputs a valid results file, its output can be used
as input for a subsequent tacl results command. To chain commands together
without creating an intermediate file, pipe the commands together and use -
instead of a filename, as:
tacl results --recriprocal results.csv | tacl results --reduce -
examples:
Extend CBETA results and set a minimum total count.
tacl results -e corpus/cbeta/ --min-count 9 output.csv > mod-output.csv
Zero-fill CBETA results.
tacl results -c dhr-vs-rest.txt -z corpus/cbeta/ output.csv > mod-output.csv
Reduce Pagel results.
tacl results --reduce -t pagel output.csv > mod-output.csv
Due to encoding issues, you may need to set the environment variable
PYTHONIOENCODING to "utf-8".