The pytextseg package provides functions to wrap plain texts: fill() and wrap() are Unicode-aware alternatives for those of textwrap standard module; fold() and unfold() are functions mainly focus on plain text messages such as e-mail.
It also provides lower level interfaces for text segmentation: LineBreak class for line breaking; GCStr class for grapheme cluster segmentation.
If you are inpatient, see “Functions”.
fold(string[, method, options...]) -> unicode
Fold lines of string string to fit in lines of no more than width columns, and return it.
Following options may be specified for method argument.
Surplus SPACEs and horizontal tabs at end of line are removed, newline sequences are replaced by that specified by optional newline argument and newline is appended at end of text if it does not exist. Horizontal tabs are treated as tab stops according to tabsize argument.
charset or language is used to determine language/region context: East Asian or not.
For other named arguments see instance attributes of LineBreak class.
unfold(text[, method]) -> unicode
Conjunct folded paragraphs of string STRING and returns it. Following options may be specified for method argument.
fill(text[, options...]) -> unicode
Reformat the single paragraph in text to fit in lines of no more than width columns, and return a new string containing the entire wrapped paragraph. Optional named arguments will be passed to wrap function.
wrap(text[, options...]) -> [unicode]
Wrap paragraphs of a text then return a list of wrapped lines.
Reformat each paragraph in text so that it fits in lines of no more than width columns if possible, and return a list of wrapped lines. By default, tabs in text are expanded and all other whitespace characters (including newline) are converted to space.
See textwrap about options.
Note
Some options take no effects on this module: fix_sentence_endings, break_on_hyphens, drop_whitespace.
For other named arguments see instance attributes of LineBreak class.
GCStr class treats Unicode string as a sequence of extended grapheme clusters defined by Unicode Standard Annex #29 ([UAX29]).
GCStr(string[, lb]) -> GCStr
Create new grapheme cluster string (GCStr object) from Unicode string string.
Optional LineBreak object lb controls breaking features. Following attributes of LineBreak object affect new GCStr object.
S.center(width[, fillchar]) -> GCStr
Return S centered in a string of width columns. Padding is done using the specified fill character (default is a space)
S.endswith(suffix[, start[, end]]) -> bool
Return True if S ends with the specified suffix, False otherwise. With optional start, test S beginning at that position. With optional end, stop comparing S at that position. suffix can also be a tuple of strings to try.
S.expandtabs([tabsize]) -> GCStr
Return a copy of S where all tab characters are expanded using spaces. If tabsize is not given, a tab size of 8 columns is assumed.
S.join(iterable) -> GCStr
Return a grapheme cluster string which is the concatenation of the strings in the iterable. The separator between elements is S.
S.ljust(width[, fillchar]) -> GCStr
Return S left-justified in a grapheme cluster string of width columns. Padding is done using the specified fill character (default is a space).
S.rjust(width[, fillchar]) -> GCStr
Return S right-justified in a string of width columns. Padding is done using the specified fill character (default is a space).
S.splitlines([keepends]) -> [GCStr]
Return a list of the lines in S, breaking at line boundaries. Line breaks are not included in the resulting list unless keepends is given and true.
Note
U+001C, U+001D and U+001E are not included in linebreak characters.
S.startswith(prefix[, start[, end]]) -> bool
Return True if S starts with the specified prefix, False otherwise. With optional start, test S beginning at that position. With optional end, stop comparing S at that position. prefix can also be a tuple of strings to try.
Deprecated since version 0.1.0: See “Methods not Supported”.
String Operations
Most of operations for string object are available on GCStr object.
Operation | Result | Notes |
---|---|---|
x in s | True if s contains a grapheme cluster x, else False | (1) |
x not in s | False if s contains a grapheme cluster x, else True | (1) |
s + t | the concatenation of s and t | (2) (3) |
s * n, n * s | n copies of s concatenated | (3) |
s[i] | ith grapheme cluster of s, origin 0 | |
s[i:j] | slice of s from i to j | |
s[i:j:k] | slice of s from i to j with step k | |
len(s) | number of grapheme clusters s contains | (4) |
min(s) | smallest grapheme cluster of s | |
max(s) | largest grapheme cluster of s | |
s < t | strictly less than | (5) |
s <= t | less than or equal | (5) |
s > t | strictly greater than | (5) |
s >= t | greater than or equal | (5) |
s == t | equal | (5) |
s != t | not equal | (5) |
str(s), unicode(s) | string representation of object. unicode() is used by Python 2.x. |
Notes:
GCStr object can not be operand of re regular expression operations.
Methods not Supported
Some string methods are not supported since they break grapheme cluster boundaries. Instead, use methods of stringified objects. For example:
# For Python 3
result = gcs * 0 + str(gcs).translate(table)
# For Python 2
result = gcs * 0 + unicode(gcs).translate(table)
gcs * 0 + ... is a convenient way to recalculate grapheme clusters.
Instance Attributes
These attributes are read-only.
Number of Unicode characters grapheme cluster string includes, i.e. length as Unicode string.
Total number of columns of grapheme clusters defined by built-in character database. For more details see documentations of LineBreak class.
Line breaking class of the first character of first grapheme cluster.
Line breaking class of last grapheme extender of last grapheme cluster. If there are no grapheme extenders or its class is CM, value of last grapheme base will be returned.
LineBreak class performs Line Breaking Algorithm described in Unicode Standard Annex #14 ([UAX14]). East_Asian_Width informative properties defined by Annex #11 ([UAX11]) will be concerned to determine breaking positions.
LineBreak([options...]) -> LineBreak
Create new LineBreak object. Optional named arguments may specify initial attribute values. See documentations of instance attributes. Initial defaults are:
break_indent=False, charmax=998, eastasian_context=False, eaw=None, format=”SIMPLE”, hangul_as_al=False, lbc=None, legacy_cm=True, minwidth=0, newline=”\n”, prep=[None], sizing=”UAX11”, urgent=None, virama_as_joiner=True, width=70
S.rule(before, after) -> int
Get possible line breaking behavior between strings before and after. Returned value is one of:
Following instance attributes of LineBreak object S will affect to result.
Note
This method gives just approximate description of line breaking behavior. Use wrap method or other functions to fold actual texts.
S.wrap(text) -> [GCStr]
Break a Unicode string text and returns list of lines contained in the result. Each item of list is grapheme cluster string (GCStr object).
Class Attributes
Dictionary containing default values of instance attributes.
Four values to specify line breaking behaviors: Mandatory break; Both direct break and indirect break are allowed; Indirect break is allowed but direct break is prohibited; Prohibited break.
Instance Attributes
About default values of these attributes see __init__().
Always allows break after SPACEs at beginning of line, a.k.a. indent. [UAX14] does not take account of such usage of SPACE.
Possible maximum number of characters in one line, not counting trailing SPACEs and newline sequence. Note that number of characters generally doesn’t represent length of line. 0 means unlimited.
Performs heuristic breaking on South East Asian complex context. If word segmentation for South East Asian writing systems is not enabled, this does not have any effect.
Enable East Asian language/region context. If it is true, characters assigned to line breaking class AI will be treated as ideographic characters (ID) and East_Asian_Width A (ambiguous) will be treated as F (fullwidth). Otherwise, they are treated as alphabetic characters (AL) and N (neutral), respectively.
Tailor classification of East_Asian_Width property defined by [UAX11]. Value may be a dictionary with its keys are Unicode string or UCS scalar and with its values are any of East_Asian_Width properties (see documentation of textseg.Consts module). If None is specified, all tailoring assigned before will be canceled. By default, no tailorings are available. See also “Tailoring Character Properties”.
Specify the method to format broken lines.
Treat hangul syllables and conjoining jamo as alphabetic characters (AL).
Tailor classification of line breaking property defined by [UAX14]. Value may be a dictionary with its keys are Unicode string or UCS scalar and its values with any of line breaking classes (See Consts module). If None is specified, all tailoring assigned before will be canceled. By default, no tailorings are available. See also “Tailoring Character Properties”.
Treat combining characters lead by a SPACE as an isolated combining character (ID). As of Unicode 5.0, such use of SPACE is not recommended.
Minimum number of columns which line broken arbitrarily may include, not counting trailing spaces and newline sequences.
Unicode string to be used for newline sequence. It may be None.
Add user-defined line breaking behavior(s). Value shall be list of items described below.
Specify method to calculate size of string. Following options are available.
See also eaw attribute.
Specify method to handle excessing lines. Following options are available.
Virama sign (“halant” in Hindi, “coeng” in Khmer) and its succeeding letter are not broken. “Default” grapheme cluster defined by [UAX29] does not contain this feature.
Maximum number of columns line may include not counting trailing spaces and newline sequence. In other words, recommended maximum length of line.
Constants for textseg package.
Index values to specify six East_Asian_Width properties defined by [UAX #11], and eawZ to specify nonspacing.
Note
Property value Z is non-standard.
Index values to specify 39 line breaking properties (classes) defined by [UAX #14].
Note
Property value CP was introduced by Unicode 5.2.0. Property value HL and CJ were introduced by Unicode 6.1.0.
Flag to determin if word segmentation for South East Asian writing systems is enabled. If this feature was enabled, a non-empty string is set. Otherwise, None is set.
Note
Current release supports Thai script of modern Thai language only.
A string to specify version of Unicode Standard this module refers.
See also “Tailoring Character Properties”.