Unicode normalization


The unicode.normalize vocabulary defines words for normalizing Unicode strings.

In Unicode, it is often possible to have multiple sequences of characters which really represent exactly the same thing. For example, to represent e with an acute accent above, there are two possible strings: "e\u000301" (the e character, followed by the combining acute accent character) and "\u0000e9" (a single character, e with an acute accent).

There are four normalization forms: NFD, NFC, NFKD, and NFKC. Basically, in NFD and NFKD, everything is expanded, whereas in NFC and NFKC, everything is contracted. In NFKD and NFKC, more things are expanded and contracted. This is a process which loses some information, so it should be done only with care.

Most of the world uses NFC to communicate, but for many purposes, NFD/NFKD is easier to process. For more information, see Unicode Standard Annex #15 and section 3 of the Unicode standard.
nfc ( string -- nfc )

nfd ( string -- nfd )

nfkc ( string -- nfkc )

nfkd ( string -- nfkd )