tacl ngrams¶

usage: tacl ngrams [-h] [-v] [-c CATALOGUE] [-m] [-r RAM] [-t {cbeta,pagel}]
                   DATABASE CORPUS MINIMUM MAXIMUM

Generate n-grams from a corpus.

positional arguments:
  DATABASE              Path to database file.
  CORPUS                Path to corpus.
  MINIMUM               Minimum size of n-gram to generate (integer).
  MAXIMUM               Maximum size of n-gram to generate (integer).

optional arguments:
  -h, --help            show this help message and exit
  -v, --verbose         Display debug information; multiple -v options
                        increase the verbosity. (default: None)
  -c CATALOGUE, --catalogue CATALOGUE
                        Path to a catalogue file used to restrict which works
                        in the corpus are added (default: None)
  -m, --memory          Use RAM for temporary database storage.
                        
                        This may cause an out of memory error, in which case
                        run the command without this switch. (default: False)
  -r RAM, --ram RAM     Number of gigabytes of RAM to use. (default: 3)
  -t {cbeta,pagel}, --tokenizer {cbeta,pagel}
                        Type of tokenizer to use. The "cbeta" tokenizer is
                        suitable for the Chinese CBETA corpus (tokens are
                        single characters or workaround clusters within square
                        brackets). The "pagel" tokenizer is for use with the
                        transliterated Tibetan corpus (tokens are sets of word
                        characters plus some punctuation used to transliterate
                        characters). (default: cbeta)

This command can be safely interrupted and subsequently rerun; witnesses that
have already had their n-grams added will be skipped.

If new witnesses need to be added after a database was generated, this command
can be run again. However, the speed at which n-grams from these new witnesses
are added will be much less than to a new database, due to the existing
indices.

If a witness has changed since a database was generated, this command will not
update the database. In this case, generate a new database or manipulate the
existing dataase directly to remove the witness and its associated n-grams.

examples:

  Create a database of 2 to 10-grams from a CBETA corpus.
    tacl ngrams cbeta2-10.db corpus/cbeta/ 2 10

  Create a database of 1 to 7-grams from a Pagel corpus.
    tacl ngrams pagel1-7.db corpus/pagel/ 1 7

  Create a database of 1 to 7-grams from a subset of the CBETA corpus.
    tacl ngrams -c dhr-texts.txt cbeta-dhr1-7.db corpus/cbeta/ 1 7
tacl ngrams¶

Related Topics

This Page