usage: tacl diff [-h] [-a LABEL] [-v] [-m] [-r RAM] [-t {cbeta,pagel}]
DATABASE CORPUS CATALOGUE
List n-grams unique to each sub-corpus (as defined by the labels in the
specified catalogue file).
positional arguments:
DATABASE Path to database file.
CORPUS Path to corpus.
CATALOGUE Path to catalogue file.
optional arguments:
-h, --help show this help message and exit
-a LABEL, --asymmetric LABEL
Label of sub-corpus to restrict results to. (default:
None)
-v, --verbose Display debug information; multiple -v options
increase the verbosity. (default: None)
-m, --memory Use RAM for temporary database storage.
This may cause an out of memory error, in which case
run the command without this switch. (default: False)
-r RAM, --ram RAM Number of gigabytes of RAM to use. (default: 3)
-t {cbeta,pagel}, --tokenizer {cbeta,pagel}
Type of tokenizer to use. The "cbeta" tokenizer is
suitable for the Chinese CBETA corpus (tokens are
single characters or workaround clusters within square
brackets). The "pagel" tokenizer is for use with the
transliterated Tibetan corpus (tokens are sets of word
characters plus some punctuation used to transliterate
characters). (default: cbeta)
Many of the n-grams that are distinct to each sub-corpus are uninteresting -
if a 2-gram is distinct, then so is every gram larger than 2 that contains
that 2-gram. Therefore the results output by this command are filtered to keep
only the most distinctive n-grams, according to the following rules (which
apply within the context of a given witness):
* If an n-gram is not composed of any (n-1)-grams found in the
results, it is kept.
* If both of the (n-1)-grams that comprise an n-gram are found in
the results, that n-gram is kept.
* Otherwise, the n-gram is removed from the results.
examples:
Make a diff query against a CBETA corpus.
tacl diff cbeta2-10.db corpus/cbeta/ dhr-vs-rest.txt > output.csv
Make an asymmetrical diff query against a CBETA corpus.
tacl diff -a Dhr cbeta2-10.db corpus/cbeta/ dhr-vs-rest.txt > output.csv
Make a diff query against a Pagel corpus.
tacl diff -t pagel pagel1-7.db corpus/pagel/ by-author.txt > output.csv
Due to encoding issues, you may need to set the environment variable
PYTHONIOENCODING to "utf-8".