Package Contents¶

textseg module¶

The pytextseg package provides functions to wrap plain texts: fill() and wrap() are Unicode-aware alternatives for those of textwrap standard module; fold() and unfold() are functions mainly focus on plain text messages such as e-mail.

It also provides lower level interfaces for text segmentation: LineBreak class for line breaking; GCStr class for grapheme cluster segmentation.

If you are inpatient, see “Functions”.

Functions¶

textseg.fold(string, method='plain', tabsize=8, charset=None, language=None, **kwds)[source]¶

fold(string[, method, options...]) -> unicode

Fold lines of string string to fit in lines of no more than width columns, and return it.

Following options may be specified for method argument.

"fixed": Lines preceded by “>” won’t be folded. Paragraphs are separated by empty line.
"flowed": “Format=Flowed; DelSp=Yes” formatting defined by RFC 3676.
"plain": Default method. All lines are folded.

Surplus SPACEs and horizontal tabs at end of line are removed, newline sequences are replaced by that specified by optional newline argument and newline is appended at end of text if it does not exist. Horizontal tabs are treated as tab stops according to tabsize argument.

charset or language is used to determine language/region context: East Asian or not.

For other named arguments see instance attributes of LineBreak class.

textseg.unfold(string, method='fixed', newline='\n', **kwds)[source]¶

unfold(text[, method]) -> unicode

Conjunct folded paragraphs of string STRING and returns it. Following options may be specified for method argument.

"fixed": Default method. Lines preceded by ">" won’t be conjuncted. Treat empty line as paragraph separator.
"flowed": Unfold “Format=Flowed; DelSp=Yes” formatting defined by RFC 3676.
"flowedsp": Unfold “Format=Flowed; DelSp=No” formatting defined by RFC 3676.

textwrap Style Functions¶

textseg.fill(text, **kwds)[source]¶

fill(text[, options...]) -> unicode

Reformat the single paragraph in text to fit in lines of no more than width columns, and return a new string containing the entire wrapped paragraph. Optional named arguments will be passed to wrap function.

textseg.wrap(text, width=70, initial_indent='', subsequent_indent='', expand_tabs=True, replace_whitespace=True, fix_sentence_endings=False, break_long_words=True, break_on_hyphens=True, drop_whitespace=True, **kwds)[source]¶

wrap(text[, options...]) -> [unicode]

Wrap paragraphs of a text then return a list of wrapped lines.

Reformat each paragraph in text so that it fits in lines of no more than width columns if possible, and return a list of wrapped lines. By default, tabs in text are expanded and all other whitespace characters (including newline) are converted to space.

See textwrap about options.

Note

Some options take no effects on this module: fix_sentence_endings, break_on_hyphens, drop_whitespace.

For other named arguments see instance attributes of LineBreak class.

GCStr class¶

class textseg.GCStr[source]¶

GCStr class treats Unicode string as a sequence of extended grapheme clusters defined by Unicode Standard Annex #29 ([UAX29]).

static __new__(string, lb=None)[source]¶

GCStr(string[, lb]) -> GCStr

Create new grapheme cluster string (GCStr object) from Unicode string string.

Optional LineBreak object lb controls breaking features. Following attributes of LineBreak object affect new GCStr object.

eastasian_context
eaw
lbc
legacy_cm
virama_as_joiner

center(width, fillchar=' ')[source]¶

S.center(width[, fillchar]) -> GCStr

Return S centered in a string of width columns. Padding is done using the specified fill character (default is a space)

endswith(suffix, start=0, end=None)[source]¶

S.endswith(suffix[, start[, end]]) -> bool

Return True if S ends with the specified suffix, False otherwise. With optional start, test S beginning at that position. With optional end, stop comparing S at that position. suffix can also be a tuple of strings to try.

expandtabs(tabsize=8)[source]¶

S.expandtabs([tabsize]) -> GCStr

Return a copy of S where all tab characters are expanded using spaces. If tabsize is not given, a tab size of 8 columns is assumed.

join(iterable)[source]¶

S.join(iterable) -> GCStr

Return a grapheme cluster string which is the concatenation of the strings in the iterable. The separator between elements is S.

ljust(width, fillchar=' ')[source]¶

S.ljust(width[, fillchar]) -> GCStr

Return S left-justified in a grapheme cluster string of width columns. Padding is done using the specified fill character (default is a space).

rjust(width, fillchar=' ')[source]¶

S.rjust(width[, fillchar]) -> GCStr

Return S right-justified in a string of width columns. Padding is done using the specified fill character (default is a space).

splitlines(keepends=False)[source]¶

S.splitlines([keepends]) -> [GCStr]

Return a list of the lines in S, breaking at line boundaries. Line breaks are not included in the resulting list unless keepends is given and true.

Note

U+001C, U+001D and U+001E are not included in linebreak characters.

startswith(prefix, start=0, end=None)[source]¶

S.startswith(prefix[, start[, end]]) -> bool

Return True if S starts with the specified prefix, False otherwise. With optional start, test S beginning at that position. With optional end, stop comparing S at that position. prefix can also be a tuple of strings to try.

translate(table)¶

Deprecated since version 0.1.0: See “Methods not Supported”.

String Operations

Most of operations for string object are available on GCStr object.

Operation	Result	Notes
`x in s`	`True` if s contains a grapheme cluster x, else `False`	(1)
`x not in s`	`False` if s contains a grapheme cluster x, else `True`	(1)
`s + t`	the concatenation of s and t	(2) (3)
`s * n, n * s`	n copies of s concatenated	(3)
`s[i]`	ith grapheme cluster of s, origin 0
`s[i:j]`	slice of s from i to j
`s[i:j:k]`	slice of s from i to j with step k
`len(s)`	number of grapheme clusters s contains	(4)
`min(s)`	smallest grapheme cluster of s
`max(s)`	largest grapheme cluster of s
`s < t`	strictly less than	(5)
`s <= t`	less than or equal	(5)
`s > t`	strictly greater than	(5)
`s >= t`	greater than or equal	(5)
`s == t`	equal	(5)
`s != t`	not equal	(5)
`str(s)`, `unicode(s)`	string representation of object. unicode() is used by Python 2.x.

Notes:

x may be Unicode string.
One of operands may be Unicode string.
Note that number of columns (see cols) or grapheme clusters (see len()) of resulting grapheme cluster string is not always equal to sum of both strings.
See also chars and cols attributes.
Comparisons are performed by Unicode string value, not concerning grapheme cluster boundaries.

GCStr object can not be operand of re regular expression operations.

Methods not Supported

Some string methods are not supported since they break grapheme cluster boundaries. Instead, use methods of stringified objects. For example:

# For Python 3
result = gcs * 0 + str(gcs).translate(table)
# For Python 2
result = gcs * 0 + unicode(gcs).translate(table)

gcs * 0 + ... is a convenient way to recalculate grapheme clusters.

Instance Attributes

These attributes are read-only.

chars¶: Number of Unicode characters grapheme cluster string includes, i.e. length as Unicode string.

cols¶: Total number of columns of grapheme clusters defined by built-in character database. For more details see documentations of LineBreak class.

lbc¶: Line breaking class of the first character of first grapheme cluster.

lbcext¶: Line breaking class of last grapheme extender of last grapheme cluster. If there are no grapheme extenders or its class is CM, value of last grapheme base will be returned.

LineBreak class¶

class textseg.LineBreak(**kwds)[source]¶

LineBreak class performs Line Breaking Algorithm described in Unicode Standard Annex #14 ([UAX14]). East_Asian_Width informative properties defined by Annex #11 ([UAX11]) will be concerned to determine breaking positions.

__init__(**kwds)[source]¶

LineBreak([options...]) -> LineBreak

Create new LineBreak object. Optional named arguments may specify initial attribute values. See documentations of instance attributes. Initial defaults are:

break_indent=False, charmax=998, eastasian_context=False, eaw=None, format=”SIMPLE”, hangul_as_al=False, lbc=None, legacy_cm=True, minwidth=0, newline=”\n”, prep=[None], sizing=”UAX11”, urgent=None, virama_as_joiner=True, width=70

breakingRule(before, after)¶

S.rule(before, after) -> int

Get possible line breaking behavior between strings before and after. Returned value is one of:

MANDATORY: Mandatory break.
DIRECT: Both direct break and indirect break are allowed.
INDIRECT: Indirect break is allowed but direct break is prohibited.
PROHIBITED: Breaking is prohibited.

Following instance attributes of LineBreak object S will affect to result.

eastasian_context
hangul_as_al
lbc
legacy_cm

Note

This method gives just approximate description of line breaking behavior. Use wrap method or other functions to fold actual texts.

wrap(text)¶

S.wrap(text) -> [GCStr]

Break a Unicode string text and returns list of lines contained in the result. Each item of list is grapheme cluster string (GCStr object).

Class Attributes

DEFAULTS¶: Dictionary containing default values of instance attributes.

MANDATORY¶

DIRECT¶

INDIRECT¶

PROHIBITED¶: Four values to specify line breaking behaviors: Mandatory break; Both direct break and indirect break are allowed; Indirect break is allowed but direct break is prohibited; Prohibited break.

Instance Attributes

About default values of these attributes see __init__().

break_indent¶: Always allows break after SPACEs at beginning of line, a.k.a. indent. [UAX14] does not take account of such usage of SPACE.

charmax¶: Possible maximum number of characters in one line, not counting trailing SPACEs and newline sequence. Note that number of characters generally doesn’t represent length of line. 0 means unlimited.

complex_breaking¶: Performs heuristic breaking on South East Asian complex context. If word segmentation for South East Asian writing systems is not enabled, this does not have any effect.

eastasian_context¶: Enable East Asian language/region context. If it is true, characters assigned to line breaking class AI will be treated as ideographic characters (ID) and East_Asian_Width A (ambiguous) will be treated as F (fullwidth). Otherwise, they are treated as alphabetic characters (AL) and N (neutral), respectively.

eaw¶: Tailor classification of East_Asian_Width property defined by [UAX11]. Value may be a dictionary with its keys are Unicode string or UCS scalar and with its values are any of East_Asian_Width properties (see documentation of textseg.Consts module). If None is specified, all tailoring assigned before will be canceled. By default, no tailorings are available. See also “Tailoring Character Properties”.

format¶

Specify the method to format broken lines.

"SIMPLE": Just only insert newline at arbitrary breaking positions.
"NEWLINE": Insert or replace newline sequences with that specified by newline option, remove SPACEs leading newline sequences or end-of-text. Then append newline at end of text if it does not exist.
"TRIM": Insert newline at arbitrary breaking positions. Remove SPACEs leading newline sequences.
None: Do nothing, even inserting any newlines.
callable object: See “Formatting Lines”.

hangul_as_al¶: Treat hangul syllables and conjoining jamo as alphabetic characters (AL).

lbc¶: Tailor classification of line breaking property defined by [UAX14]. Value may be a dictionary with its keys are Unicode string or UCS scalar and its values with any of line breaking classes (See Consts module). If None is specified, all tailoring assigned before will be canceled. By default, no tailorings are available. See also “Tailoring Character Properties”.

legacy_cm¶: Treat combining characters lead by a SPACE as an isolated combining character (ID). As of Unicode 5.0, such use of SPACE is not recommended.

minwidth¶: Minimum number of columns which line broken arbitrarily may include, not counting trailing spaces and newline sequences.

newline¶: Unicode string to be used for newline sequence. It may be None.

prep¶

Add user-defined line breaking behavior(s). Value shall be list of items described below.

"NONBREAKURI": Won’t break URIs.
"BREAKURI": Break URIs according to a rule suitable for printed materials. For more details see [CMOS], sections 6.17 and 17.11.
(regex, callable object[, flags]): The sequences matching regex will be broken by callable object. If regex is a string, not a regex object, flags may be specified. For more details see “User-Defined Breaking Behaviors”.
None: Cancel all methods assigned before.

sizing¶

Specify method to calculate size of string. Following options are available.

"UAX11": Sizes are computed by columns of each characters.
None: Number of grapheme clusters (See documentation of GCStr class) contained in the string.
callable object: See “Calculating String Size”.

Exception¶

exception textseg.LineBreakException[source]¶: See urgent attribute of LineBreak class.

textseg.Consts module¶

Constants for textseg package.

textseg.Consts.eawNa¶

textseg.Consts.eawN¶

textseg.Consts.eawA¶

textseg.Consts.eawW¶

textseg.Consts.eawH¶

textseg.Consts.eawF¶

textseg.Consts.eawZ¶: Index values to specify six East_Asian_Width properties defined by [UAX #11], and eawZ to specify nonspacing.

Note

Property value Z is non-standard.

textseg.Consts.lbcBK¶

textseg.Consts.lbcCR¶

textseg.Consts.lbcLF¶

textseg.Consts.lbcNL¶

textseg.Consts.lbcSP¶

textseg.Consts.lbcOP¶

textseg.Consts.lbcCL¶

textseg.Consts.lbcCP¶

textseg.Consts.lbcQU¶

textseg.Consts.lbcGL¶

textseg.Consts.lbcNS¶

textseg.Consts.lbcEX¶

textseg.Consts.lbcSY¶

textseg.Consts.lbcIS¶

textseg.Consts.lbcPR¶

textseg.Consts.lbcPO¶

textseg.Consts.lbcNU¶

textseg.Consts.lbcAL¶

textseg.Consts.lbcHL¶

textseg.Consts.lbcID¶

textseg.Consts.lbcIN¶

textseg.Consts.lbcHY¶

textseg.Consts.lbcBA¶

textseg.Consts.lbcBB¶

textseg.Consts.lbcB2¶

textseg.Consts.lbcCB¶

textseg.Consts.lbcZW¶

textseg.Consts.lbcCM¶

textseg.Consts.lbcWJ¶

textseg.Consts.lbcH2¶

textseg.Consts.lbcH3¶

textseg.Consts.lbcJL¶

textseg.Consts.lbcJV¶

textseg.Consts.lbcJT¶

textseg.Consts.lbcSG¶

textseg.Consts.lbcAI¶

textseg.Consts.lbcCJ¶

textseg.Consts.lbcSA¶

textseg.Consts.lbcXX¶: Index values to specify 39 line breaking properties (classes) defined by [UAX #14].

Note

Property value CP was introduced by Unicode 5.2.0. Property value HL and CJ were introduced by Unicode 6.1.0.

textseg.Consts.sea_support¶: Flag to determin if word segmentation for South East Asian writing systems is enabled. If this feature was enabled, a non-empty string is set. Otherwise, None is set.

Note

Current release supports Thai script of modern Thai language only.

textseg.Consts.unicode_version¶: A string to specify version of Unicode Standard this module refers.

See also “Tailoring Character Properties”.

Package Contents¶

textseg module¶

Functions¶

textwrap Style Functions¶

GCStr class¶

LineBreak class¶

Exception¶

textseg.Consts module¶

Table Of Contents

Previous topic

Next topic

This Page

Navigation

Package Contents¶

textseg module¶

Functions¶

textwrap Style Functions¶

GCStr class¶

LineBreak class¶

Exception¶

textseg.Consts module¶

Table Of Contents

Previous topic

Next topic

This Page

Quick search

Navigation