usage: tacl ngrams [-h] [-v] [-c CATALOGUE] [-m] [-r RAM] [-t {cbeta,pagel}]
DATABASE CORPUS MINIMUM MAXIMUM
Generate n-grams from a corpus.
positional arguments:
DATABASE Path to database file.
CORPUS Path to corpus.
MINIMUM Minimum size of n-gram to generate (integer).
MAXIMUM Maximum size of n-gram to generate (integer).
optional arguments:
-h, --help show this help message and exit
-v, --verbose Display debug information; multiple -v options
increase the verbosity. (default: None)
-c CATALOGUE, --catalogue CATALOGUE
Path to a catalogue file used to restrict which works
in the corpus are added (default: None)
-m, --memory Use RAM for temporary database storage.
This may cause an out of memory error, in which case
run the command without this switch. (default: False)
-r RAM, --ram RAM Number of gigabytes of RAM to use. (default: 3)
-t {cbeta,pagel}, --tokenizer {cbeta,pagel}
Type of tokenizer to use. The "cbeta" tokenizer is
suitable for the Chinese CBETA corpus (tokens are
single characters or workaround clusters within square
brackets). The "pagel" tokenizer is for use with the
transliterated Tibetan corpus (tokens are sets of word
characters plus some punctuation used to transliterate
characters). (default: cbeta)
This command can be safely interrupted and subsequently rerun; witnesses that
have already had their n-grams added will be skipped.
If new witnesses need to be added after a database was generated, this command
can be run again. However, the speed at which n-grams from these new witnesses
are added will be much less than to a new database, due to the existing
indices.
If a witness has changed since a database was generated, this command will not
update the database. In this case, generate a new database or manipulate the
existing dataase directly to remove the witness and its associated n-grams.
examples:
Create a database of 2 to 10-grams from a CBETA corpus.
tacl ngrams cbeta2-10.db corpus/cbeta/ 2 10
Create a database of 1 to 7-grams from a Pagel corpus.
tacl ngrams pagel1-7.db corpus/pagel/ 1 7
Create a database of 1 to 7-grams from a subset of the CBETA corpus.
tacl ngrams -c dhr-texts.txt cbeta-dhr1-7.db corpus/cbeta/ 1 7