If you specify callable object as format property of LineBreak object, it should accept three arguments:
callable_object(self, context, string) -> text_or_None
self is a LineBreak object, context is a string to determine the context that subroutine was called in, and string is a fragment of Unicode string leading or trailing breaking position.
context When Value of string "sot" Beginning of text Fragment of first line "sop" After mandatory break Fragment of next line "sol" After arbitrary break Fragment on sequel of line "" Just before any breaks Complete line without trailing SPACEs "eol" Arbitrary break SPACEs leading breaking position "eop" Mandatory break Newline and its leading SPACEs "eot" End of text SPACEs (and newline) at end of text
Callable object should return modified text fragment or may return None to express that no modification occurred. Note that modification in the context of "sot", "sop" or "sol" may affect decision of successive breaking positions while in the others won’t.
Note
String arguments are actually sequences of grapheme clusters. See documentation of GCStr class.
For example, following code folds lines removing trailing spaces:
from textseg import LineBreak
def format(self, event, string):
if event.startswith('eo'):
return "\n"
return None
lb = LineBreak(format = format)
output = ''.join([str(s) for s in lb.wrap(text)])
When a line generated by arbitrary break is expected to be beyond measure of either charmax, width or minwidth, urgent break may be performed on successive string. If you specify callable object as a value of urgent attribute, it should accept two arguments:
callable_object(self, string) -> [text, ...]
self is a LineBreak object and string is a Unicode string to be broken.
Callable object should return a list of broken items of string.
Note
String argument is actually a sequence of grapheme clusters. See GCStr class.
For example, following code inserts hyphen to the name of several chemical substances (such as Titin) so that it may be folded:
# Example not yet written
If you specify (regular expression, callable object[, flags]) tuple as any item of prep option, callable object should accept two arguments:
callable_object(self, string) -> [text, ...]
self is a LineBreak object and string is a Unicode string matched with regular expression.
Callable object should return a list of broken items of string.
For example, following code will break HTTP URLs using [CMOS] rule:
urire = re.compile(r'\b(?:url:)?http://[\x21-\x7E]+',
re.I + re.U)
def breakURI(self, s):
r = ''
ret = []
b = ''
for c in s:
if b == '':
r = c
elif r.lower().endswith('url:'):
ret.append(r)
r = c
elif b in '/' and not c in '/' or \
not b in '-.' and c in '-~.,_?\#%=&' or \
b in '=&' or c in '=&':
if r != '':
ret.append(r)
r = c
else:
r += c
b = c
if r != '':
ret.append(r)
return ret
output = fill(text, prep = [(urire, breakURI)])
Changed in version 0.1.1: prep attribute accepts tuples with third item flags.
If you specify callable object as a value of sizing property, it will be called with five arguments:
callable_object(self, length, pre, spc, string) -> number_of_columns
self is a LineBreak object, length is size of preceding string, pre is preceding Unicode string, spc is additional SPACEs and string is a Unicode string to be processed.
Callable object should return calculated number of columns of pre + spc + string. The number of columns may not be an integer: Unit of the number may be freely chosen, however, it should be same as those of minwidth and width properties.
Note
String arguments are actually sequences of grapheme clusters. See GCStr class.
For example, following code processes lines with tab stops by each eight columns:
from textseg import fill
from textseg.Consts import lbcSP
def sizing(self, cols, pre, spc, string):
spcstr = spc + string
i = 0
for c in spcstr:
if c.lbc != lbcSP:
cols += spcstr[i:].cols
break
if c == "\t":
cols += 8 - (cols % 8)
else:
cols += c.cols
i = i + 1
return cols
output = fill(text, lbc = {ord("\t"): lbcSP}, sizing = sizing,
expand_tabs = False)
Character properties may be tailored by lbc and eaw options. Some constants are defined for convenience of tailoring.
By default, several hiragana, katakana and characters corresponding to kana are treated as non-starters (NS or CJ). When the lbc attribute is updated by following items, these characters are treated as normal ideographic characters (ID).
Ideographic iteration marks. U+3005 々 IDEOGRAPHIC ITERATION MARK, U+303B 〻 VERTICAL IDEOGRAPHIC ITERATION MARK, U+309D ゝ HIRAGANA ITERATION MARK, U+309E ゞ HIRAGANA VOICED ITERATION MARK, U+30FD ヽ KATAKANA ITERATION MARK and U+30FE ヾ KATAKANA VOICED ITERATION MARK.
Note
Some of them are neither hiragana nor katakana.
{ KANA_SMALL_LETTERS: lbcID }
Hiragana or katakana small letters.
Hiragana small letters: U+3041 ぁ “A”, U+3043 ぃ “I”, U+3045 ぅ “U”, U+3047 ぇ “E”, U+3049 ぉ “O”, U+3063 っ “TU”, U+3083 ゃ “YA”, U+3085 ゅ “YU”, U+3087 ょ “YO”, U+308E ゎ “WA”, U+3095 ゕ “KA”, U+3096 ゖ “KE”.
Katakana small letters: U+30A1 ァ “A”, U+30A3 ィ “I”, U+30A5 ゥ “U”, U+30A7 ェ “E”, U+30A9 ォ “O”, U+30C3 ッ “TU”, U+30E3 ャ “YA”, U+30E5 ュ “YU”, U+30E7 ョ “YO”, U+30EE ヮ “WA”, U+30F5 ヵ “KA”, U+30F6 ヶ “KE”.
Katakana phonetic extensions: U+31F0 ㇰ “KU” - U+31FF ㇿ “RO”.
Halfwidth katakana small letters: U+FF67 ァ “A” - U+FF6F ッ “TU”.
Note
These letters and prolonged sound marks below are optionally treated either as non-starter or as normal ideographic. See [JISX4051] 6.1.1, [JLREQ] 3.1.7 or [UAX14].
Note
U+3095 ゕ “KA”, U+3096 ゖ “KE”, U+30F5 ヵ “KA” and U+30F6 ヶ “KE” are considered to be neither hiragana nor katakana.
{ KANA_PROLONGED_SOUND_MARKS: lbcID }
Hiragana or katakana prolonged sound marks. U+30FC ー KATAKANA-HIRAGANA PROLONGED SOUND MARK and U+FF70 ー HALFWIDTH KATAKANA-HIRAGANA PROLONGED SOUND MARK.
U+303C 〼 MASU MARK.
Note
Although this character is not kana, it is usually regarded as abbreviation to sequence of hiragana ま す or katakana マ ス, MA and SU.
Note
This character is classified as non-starter (NS) by [UAX14] and as Class 13 (corresponding to ID) by [JISX4051] and [JLREQ].
By default, some punctuations are ambiguous quotation marks (QU).
Danish, Finnish, Norwegian and Swedish use 9-style or right-pointing punctuations (’ ” » ›) as both opening and closing quotation marks.
Some particular letters of Latin, Greek and Cyrillic scripts have ambiguous (A) East_Asian_Width property. Thus, these characters are treated as wide when eastasian_context attribute is true. Updating eaw attribute with following values, those characters are always treated as narrow.
{ AMBIGUOUS_CYRILLIC: eawN }
{ AMBIGUOUS_GREEK: eawN }
On the other hand, despite several characters were occasionally rendered as wide characters by number of implementations for East Asian character sets, they are given narrow (Na) East_Asian_Width property just because they have fullwidth (F) compatibility characters. Updating eaw attribute with following values, those characters are treated as ambiguous — wide when eastasian_context attribute is true.