CWB.CL: The Lowlevel Interface

The Cython module CWB.CL is modeled after the Perl low-level interface for CQP, CWB::CL, and allows direct access to the attributes of a corpus.

CQP organizes access to a corpus in terms of corpus positions, which involve a numbering of all tokens in the corpus from 0 to max_cpos. Sentences or text boundaries are modeled as a non-overlapping sequence of spans over these corpus positions.

class CWB.CL.Corpus

The Corpus class represents one CQP corpus and its (positional and span) attributes.

__init__(self, cname, encoding="ISO-8859-15", registry_dir=None)

Loads the corpus named cname. By using a non-None value for the registry_dir parameter, it is possible to use corpus description files in other locations than the default /usr/local/share/cwb/registry.

attribute(self, name, atype)

Retrieves the corpus attribute named name, which can either be a positional attribute (atype='p') with information for each token, or a structural attribute (atype='s') with spans of tokens (which can optionally be labeled). Positional attributes usually include word forms, POS tags and such, whereas structural attributes are typically used to represent sentence or text boundaries.

Depending on the type of attribute, this returns either a PosAttrib, a AttStruc or an AlignAttrib object.

class CWB.CL.PosAttrib

This class represents a positional attribute. This can be accessed like a sequence of strings: A len(attr) returns the length of the corpus in tokens, and an attr[idx] returns the attribute for the idx-th token.

getName(self)

returns the name of the attribute

getDictionary(self)

returns the AttrDictionary object related to this attribute, wich contains string-to-number mapping and frequency information.

get_encoding(self)

returns a string describing the encoding of the corpus (based on the information in the CQP registry)

to_unicode(self, s)

if s is a raw string, it will decoded to a Unicode object

__getitem__(self, offset)

returns the attribute value at position offset as a string.

cpos2id(self, offset)

gives the number-coded attribute value at position offset.

find(self, tag)

returns an IDList with the occurrence positions of the value tag. Raises a KeyError if the value is not present in the corpus at all.

find_list(self, tags)

returns an IDList with the occurrence positions for a token that has any of the attribute values in tags.

find_pattern(self, pat, flags=0)

Matches the regular expression pat against corpus attributes and returns an IDList with any matching tokens.

frequency(self, tag)

returns the frequency of the attribute value tag in the corpus.

__len__(self)

returns the size of the attribute (i.e., the size of the corpus in tokens).

class CWB.CL.AttStruc

represents a structural attribute. These attributes behave like a sequence of tuples, either tuples of (first,last) positions or as triples of (first,last,val) with a string attribute.

getName(self)

returns the name of the attribute

find_all(self, tags)

For structural attributes with a string value, returns an IDList with the structure indices with all attributes whose string values match tags.

find_pos(self, offset)

returns the start/end tuple for the structure spanning the corpus position offset.

__getitem__(self, idx)

returns the start/end tuple for the idx‘th structure

cpos2struc(self, offset)

returns the structure number for the structure spanning the corpus position offset (e.g., matches a word position to its sentence number).

map_idlist(self, IDList lst not None)

maps an IDList with corpus positions to an IDList with the corresponding structure offsets, removing duplicates.

__len__(self)

returns the size of the attribute (here: the number of annotated spans in the corpus).

class CWB.CL.AlignAttrib

For aligned parallel corpora, an alignment attribute contains spans (a1,a2,b1,b2) that correspond to an alignment between positions a1..a2 of the source corpus with positions b1..b2 of the aligned corpus.

getName(self)

returns the name of the attribute

cpos2alg(self, cpos)

finds the aligned span that corresponds to this corpus position. Raises a KeyError if the corpus position is unaligned.

__len__(self)

returns the size of the attribute (here: the number of aligned spans in the corpus).

class CWB.CL.IDList

An IDList corresponds to a set of corpus positions, or a set of structure indices. An IDList behaves like a sorted sequence of numbers (i.e., lst[1] yields the second position, and len(lst) yields the size of the set). Boolean operations, such as lst1+lst2 to get the union of the corpus positions, or lst1&lst2 to get the intersection of corpus positions, are supported.

join(self, other, offset)

returns the intersection of this IDList with other, shifted by offset. This can be used to find sequences of one word following another, or of one sentence containing a match for X and the next containing a match for Y.

class CWB.CL.AttrDictionary

An AttrDictionary corresponds to the set of values that an attribute can take. It is useful to retrieve IDs or frequencies for the possible values of that attribute.

__len__(self)

returns the number of possible values. This and the __getitem method make it possible to iterate over the attribute dictionary.

get_word(self, n)

returns the attribute value corresponding to the numeric ID n.

get_matching(self, pat)

returns a IDList containing the numerical IDs of matching values.

expand_pattern(self, pat, flags=0)

returns a list of strings for values matching the pattern.

Previous topic

The PyCQP_interface Subprocess wrapper

This Page