tacl results

usage: tacl results [-h] [-v] [-b CORPUS] [--max-be-count COUNT]
                    [-c CATALOGUE] [-e CORPUS] [--min-count COUNT]
                    [--max-count COUNT] [--min-count-work COUNT]
                    [--max-count-work COUNT] [--min-size SIZE]
                    [--max-size SIZE] [--min-works COUNT] [--max-works COUNT]
                    [--ngrams NGRAMS] [--reciprocal] [--reduce]
                    [--remove LABEL] [--sort] [-t {cbeta,pagel}] [-z CORPUS]
                    RESULTS

Modify a query results file by adding, removing or otherwise manipulating
result rows. Outputs the new set of results.

positional arguments:
  RESULTS               Path to CSV results; use - for stdin.

optional arguments:
  -h, --help            show this help message and exit
  -v, --verbose         Display debug information; multiple -v options
                        increase the verbosity. (default: None)
  -c CATALOGUE, --catalogue CATALOGUE
                        Path to the catalogue file used to generate the
                        results (default: None)
  -e CORPUS, --extend CORPUS
                        Extend the results to list the highest size grams that
                        also count as matches, going beyond the maximum size
                        recorded in the database. This has no effect if the
                        results contain only 1-grams. (default: None)
  --min-count COUNT     Minimum total count per n-gram to include. (default:
                        None)
  --max-count COUNT     Maximum total count per n-gram to include. (default:
                        None)
  --min-count-work COUNT
                        Minimum count per n-gram per work to include; if a
                        single witness meets this criterion for an n-gram, all
                        instances of that n-gram are kept. (default: None)
  --max-count-work COUNT
                        Maximum count per n-gram per work to include; if a
                        single witness meets this criterion for an n-gram, all
                        instances of that n-gram are kept. (default: None)
  --min-size SIZE       Minimum size of n-grams to include. (default: None)
  --max-size SIZE       Maximum size of n-grams to include. (default: None)
  --min-works COUNT     Minimum count of works containing n-gram to include.
                        (default: None)
  --max-works COUNT     Maximum count of works containing n-gram to include.
                        (default: None)
  --ngrams NGRAMS       Path to file containing n-grams (one per line) to
                        exclude. (default: None)
  --reciprocal          Remove n-grams that are not attested by at least one
                        work in each labelled set of works. This can be useful
                        after reducing a set of intersection results.
                        (default: False)
  --reduce              Remove n-grams that are contained in larger n-grams.
                        (default: False)
  --remove LABEL        Remove labelled results. (default: None)
  --sort                Sort the results. (default: False)
  -t {cbeta,pagel}, --tokenizer {cbeta,pagel}
                        Type of tokenizer to use. The "cbeta" tokenizer is
                        suitable for the Chinese CBETA corpus (tokens are
                        single characters or workaround clusters within square
                        brackets). The "pagel" tokenizer is for use with the
                        transliterated Tibetan corpus (tokens are sets of word
                        characters plus some punctuation used to transliterate
                        characters). (default: cbeta)
  -z CORPUS, --zero-fill CORPUS
                        Add rows with a count of 0 for each n-gram in each
                        witness of a work that has at least one witness
                        bearing that n-gram. The catalogue used to generate
                        the results must also be specified with the -c option.
                        (default: None)

bifurcated extend:
  -b CORPUS, --bifurcated-extend CORPUS
                        Extend results to bifurcation points. Generates
                        results containing those n-grams, derived from the
                        original n-grams, that have a label count higher than
                        their containing (n+1)-grams, or that have a label
                        count of one and the constituent (n-1)-grams have a
                        higher label count. (default: None)
  --max-be-count COUNT  Maximum size of n-gram to extend to (default: None)

If more than one modifier is specified, they are applied in the following
order: --extend, --bifurcated-extend, --reduce, --reciprocal, --zero-fill,
--ngrams, --min/max-works, --min/max-size, --min/max-count, --min/max-count-
work, --remove, --sort.

It is important to be careful with the use of --reduce. Coupled with --max-
size, many results may be discarded without trace (since the reduce occurs
first). Note too that performing "reduce" on a set of results more than once
will make the results inaccurate!

--extend applies before --reduce because it may generate results that are also
amenable to reduction.

--extend applies before --remove because it depends on there being at least
two labels in the results in order to give correct results.

--min-count and --max-count set the range within which the total count of each
n-gram, across all works, must fall. For each work, its count is taken as the
highest count among its witnesses.

--min-works and --max-works count works rather than witnesses.

If both --min-count-work and --max-count-work are specified, only those
n-grams are kept that have at least one witness whose count falls within that
range.

Since this command always outputs a valid results file, its output can be used
as input for a subsequent tacl results command. To chain commands together
without creating an intermediate file, pipe the commands together and use -
instead of a filename, as:

    tacl results --recriprocal results.csv | tacl results --reduce -

examples:

  Extend CBETA results and set a minimum total count.
    tacl results -e corpus/cbeta/ --min-count 9 output.csv > mod-output.csv

  Zero-fill CBETA results.
    tacl results -c dhr-vs-rest.txt -z corpus/cbeta/ output.csv > mod-output.csv

  Reduce Pagel results.
    tacl results --reduce -t pagel output.csv > mod-output.csv

Due to encoding issues, you may need to set the environment variable
PYTHONIOENCODING to "utf-8".