Word and grapheme breaks


The unicode.breaks vocabulary partially implements Unicode Standard Annex #29. This provides for segmentation of a string along grapheme and word boundaries. In Unicode, a grapheme, or a basic unit of display in text, may be more than one code point. For example, in the string "e\u000301" (where U+0301 is a combining acute accent), there is only one grapheme, as the acute accent goes above the e, forming a single grapheme. Word breaks, in general, are more complicated than simply splitting by whitespace, and the Unicode algorithm provides for that.

Operations for graphemes:
first-grapheme ( str -- i )

first-grapheme-from ( start str -- i )

last-grapheme ( str -- i )

last-grapheme-from ( end str -- i )

>graphemes ( str -- graphemes )

string-reverse ( str -- rts )


Operations on words:
first-word ( str -- i )

first-word-from ( start str -- i )

last-word ( str -- i )

last-word-from ( end str -- i )

>words ( str -- words )